Configure a Semantic Router
Define backend model pools, signal rules, and routing decisions. The router exposes a single /v1/chat/completions endpoint your agents use — no code changes needed.
Give the router a name and configure the listener. Agents connect to the listener endpoint; they never talk to backends directly.
host.containers.internal when running inside a container to reach host services.
http://host:8899/v1/chat/completions.
Add the model backends this router can forward to. At least one is required. The default model receives requests when no decision rule matches.
/v1 — the router appends the path automatically.
Signals detect patterns in incoming requests. Decisions consume signals and pick a model. Start with keyword signals; add embedding or domain signals for more precision.
Decisions are evaluated by priority (highest first). The first matching rule selects the backend model for the request.
When on, the router adds reasoning parameters to the request before forwarding it to the target model.
Catches all unmatched requests and forwards to the default model: mlx-community/Qwen3-Coder-Next-4bit. Change the default model in Step 2.
Optional capabilities that layer on top of signal-decision routing. Each is independently toggleable.
Read-only preview of the config.yaml this router configuration would produce. Deploy it with vllm-sr serve --config config.yaml.
version: v0.1 listeners: - name: "http-8899" address: "0.0.0.0" port: 8899 timeout: "300s" providers: models: - name: "mlx-community/Qwen3-Coder-Next-4bit" param_size: "80b" endpoints: - name: "mlx-local" weight: 100 endpoint: "host.containers.internal:8000" protocol: "http" capabilities: ["coding", "debugging", "refactoring"] quality_score: 0.85 pricing: currency: "USD" prompt_per_1m: 0.0 completion_per_1m: 0.0 - name: "gemini-2.5-pro" param_size: "400b" endpoints: - name: "gemini-primary" weight: 100 endpoint: "generativelanguage.googleapis.com/v1beta/openai" protocol: "https" access_key: "••••••••••••••••" capabilities: ["reasoning", "math", "analysis", "coding", "creative"] quality_score: 0.95 pricing: currency: "USD" prompt_per_1m: 1.25 completion_per_1m: 10.00 default_model: "mlx-community/Qwen3-Coder-Next-4bit" signals: keywords: - name: "reasoning_keywords" operator: "OR" keywords: ["prove", "derive", "theorem", "induction", "research", "formal verification"] case_sensitive: false - name: "coding_keywords" operator: "OR" keywords: ["implement", "refactor", "debug", "function", "class", "build"] case_sensitive: false decisions: - name: "reasoning-route" description: "Route complex reasoning tasks to Gemini 2.5 Pro" priority: 100 rules: operator: "OR" conditions: - type: "keyword" name: "reasoning_keywords" modelRefs: - model: "gemini-2.5-pro" use_reasoning: true - name: "coding-route" description: "Route coding tasks to local Qwen3-Coder-Next" priority: 80 rules: operator: "OR" conditions: - type: "keyword" name: "coding_keywords" modelRefs: - model: "mlx-community/Qwen3-Coder-Next-4bit" use_reasoning: false - name: "default-route" description: "Default route to local model for cost savings" priority: 1 rules: operator: "AND" conditions: [] modelRefs: - model: "mlx-community/Qwen3-Coder-Next-4bit" use_reasoning: false advanced: semantic_cache: true jailbreak_detection: false pii_detection: false complexity_scoring: false