New Semantic Router

Configure a Semantic Router

Define backend model pools, signal rules, and routing decisions. The router exposes a single /v1/chat/completions endpoint your agents use — no code changes needed.

Basic setup

Give the router a name and configure the listener. Agents connect to the listener endpoint; they never talk to backends directly.

Router name * Used as the identifier when selecting this router in workspace creation.

Description

Listener address Use host.containers.internal when running inside a container to reach host services.

Listener port Default: 8899. Agents connect to http://host:8899/v1/chat/completions.

Timeout Maximum time before the request is cancelled.

Backend models

Add the model backends this router can forward to. At least one is required. The default model receives requests when no decision rule matches.

Backend model 1

Default

Model name * Must match exactly what the backend expects (Hugging Face model ID or alias).

Endpoint URL * Do not add /v1 — the router appends the path automatically.

Protocol

Weight

Param size

Quality score

Pricing (USD / 1M tokens)

Capabilities

coding debugging refactoring reasoning math analysis creative

Backend model 2

Model name *

Endpoint URL *

Protocol

Weight

Param size

Quality score

Pricing (USD / 1M tokens)

API access key Stored encrypted on this device. For shared deployments, reference a Secret Vault entry instead.

Capabilities

reasoning math analysis coding creative debugging

Signals

Signals detect patterns in incoming requests. Decisions consume signals and pick a model. Start with keyword signals; add embedding or domain signals for more precision.

Neural domain classifier (mmBERT)

Automatically detects request domains (math, code, creative writing) using a 307M-parameter multilingual model. No keyword lists needed. Adds ~40 ms classification latency.

Keyword reasoning_keywords

Signal name

Operator

Keywords (press Enter to add)

prove derive theorem induction research formal verification

Case-sensitive matching

Keyword coding_keywords

Signal name

Operator

Keywords (press Enter to add)

implement refactor debug function class build

Case-sensitive matching

Decisions

Decisions are evaluated by priority (highest first). The first matching rule selects the backend model for the request.

Priority 100 reasoning-route

Decision name

Priority

Route operator

Condition — signal reference

Target model

Enable reasoning mode on target model

When on, the router adds reasoning parameters to the request before forwarding it to the target model.

Priority 80 coding-route

Decision name

Priority

Route operator

Condition — signal reference

Target model

Enable reasoning mode on target model

Priority 1 default-route auto-generated

Catches all unmatched requests and forwards to the default model: mlx-community/Qwen3-Coder-Next-4bit. Change the default model in Step 2.

Advanced features

Optional capabilities that layer on top of signal-decision routing. Each is independently toggleable.

Semantic cache (HNSW)

Deduplicates paraphrased identical requests using vector similarity. Cache hits skip model inference entirely.

Jailbreak detection

Blocks prompt injection attempts before they reach models. Uses a neural safety classifier.

PII detection & redaction

Redacts emails, phone numbers, and API tokens from requests before they reach the model backend.

Complexity scoring

Neural pipeline estimates request complexity and adjusts routing thresholds accordingly. Improves cost efficiency for mixed workloads.

Generated config

Read-only preview of the config.yaml this router configuration would produce. Deploy it with vllm-sr serve --config config.yaml.

config.yaml

version: v0.1

listeners:
  - name: "http-8899"
    address: "0.0.0.0"
    port: 8899
    timeout: "300s"

providers:
  models:
    - name: "mlx-community/Qwen3-Coder-Next-4bit"
      param_size: "80b"
      endpoints:
        - name: "mlx-local"
          weight: 100
          endpoint: "host.containers.internal:8000"
          protocol: "http"
      capabilities: ["coding", "debugging", "refactoring"]
      quality_score: 0.85
      pricing:
        currency: "USD"
        prompt_per_1m: 0.0
        completion_per_1m: 0.0

    - name: "gemini-2.5-pro"
      param_size: "400b"
      endpoints:
        - name: "gemini-primary"
          weight: 100
          endpoint: "generativelanguage.googleapis.com/v1beta/openai"
          protocol: "https"
      access_key: "••••••••••••••••"
      capabilities: ["reasoning", "math", "analysis", "coding", "creative"]
      quality_score: 0.95
      pricing:
        currency: "USD"
        prompt_per_1m: 1.25
        completion_per_1m: 10.00

  default_model: "mlx-community/Qwen3-Coder-Next-4bit"

signals:
  keywords:
    - name: "reasoning_keywords"
      operator: "OR"
      keywords: ["prove", "derive", "theorem", "induction", "research", "formal verification"]
      case_sensitive: false
    - name: "coding_keywords"
      operator: "OR"
      keywords: ["implement", "refactor", "debug", "function", "class", "build"]
      case_sensitive: false

decisions:
  - name: "reasoning-route"
    description: "Route complex reasoning tasks to Gemini 2.5 Pro"
    priority: 100
    rules:
      operator: "OR"
      conditions:
        - type: "keyword"
          name: "reasoning_keywords"
    modelRefs:
      - model: "gemini-2.5-pro"
        use_reasoning: true

  - name: "coding-route"
    description: "Route coding tasks to local Qwen3-Coder-Next"
    priority: 80
    rules:
      operator: "OR"
      conditions:
        - type: "keyword"
          name: "coding_keywords"
    modelRefs:
      - model: "mlx-community/Qwen3-Coder-Next-4bit"
        use_reasoning: false

  - name: "default-route"
    description: "Default route to local model for cost savings"
    priority: 1
    rules:
      operator: "AND"
      conditions: []
    modelRefs:
      - model: "mlx-community/Qwen3-Coder-Next-4bit"
        use_reasoning: false

advanced:
  semantic_cache: true
  jailbreak_detection: false
  pii_detection: false
  complexity_scoring: false