Models Semantic Routers enterprise-router

enterprise-router

Running

Routes by cost, data residency & latency — local · OpenShift AI · Vertex AI · listening on 0.0.0.0:8901

Edit
Backend models
3
local · OpenShift AI · Vertex AI
Routing decisions
4
incl. default-route
Signals
2
keyword + semantic
Routing latency (P50)
55ms
P99: 120 ms
Local requests
61%
stayed on qwen3-code
Est. cost savings
74%
vs. all-cloud baseline
Routing flow
Agents
incoming requests
enterprise-router
localhost:8901
sensitive_data ibm-granite-3.3-8b (RHOAI)
high_complexity claude-sonnet-4-5 (Vertex)
default qwen3-code (local)
local
qwen3-code
:11434 (Ollama)
61% Free
OpenShift AI
ibm-granite-3.3-8b
rhoai.corp.example/…
28% $0.50/1M
Vertex AI
claude-sonnet-4-5
aiplatform.googleapis.com/…
11% $3–15/1M
Cost savings — this week
With enterprise-router $7.82
Local · $0.00 (61%) OpenShift AI · $1.58 (28%) Vertex AI · $6.24 (11%)
Without router — all Vertex AI $29.78
$21.96 saved this week · 74% reduction
Routing 61% of requests to local and 28% to on-prem avoids $21.96 in Vertex AI charges
Router configuration
Listener 0.0.0.0:8901 (timeout: 300s)
Last verified 1 minute ago
Description Enterprise hybrid router — optimises for cost, data residency, and latency across local, on-prem, and cloud backends
Default model qwen3-code (local · free)
Advanced features Cost optimizer · Data residency enforcement · Latency scoring
Created May 29, 2026
Backend models
Add backend model
Model name Tier Endpoint Weight Quality Capabilities Pricing (USD/1M)
qwen3-code
32B · http · default
local localhost:11434 100 0.82
codingdebugging
Free (local)
ibm-granite-3.3-8b-instruct
8B · https · on-prem
OpenShift AI rhoai.corp.example/… 100 0.88
enterpriseanalysissecure
$0.50 / $0.50
claude-sonnet-4-5@20250929 (Vertex)
200K ctx · https · GCP enterprise
Vertex AI aiplatform.googleapis.com/… 100 0.96
reasoningcodinganalysiscreative
$3.00 / $15.00
Signals
New signal
Keyword sensitive_data operator: OR · case-insensitive
confidential, internal only, GDPR, PII, personal data, classified, proprietary, trade secret, NDA, restricted
Semantic high_complexity threshold: 0.75 · embedding: nomic-embed-v2
Requests semantically similar to: multi-step reasoning, formal analysis, architectural design, complex refactoring, system design at scale
Routing decisions — evaluated highest priority first
New routing decision
100 Priority
residency-route
Sensitive or classified data must stay on-premise — routed to OpenShift AI
Condition: sensitive_data (OR) → Model: ibm-granite-3.3-8b-instruct Data residency: on-prem enforced
80 Priority
complexity-route
High-complexity requests benefit from frontier reasoning — escalated to Vertex AI
Condition: high_complexity ≥ 0.75 → Model: claude-sonnet-4-5 (Vertex) GCP enterprise contract
50 Priority
latency-route
P99 latency budget exceeded on local backend — overflow to OpenShift AI
Condition: latency_p99 > 400 ms → Model: ibm-granite-3.3-8b-instruct SLA: 300 ms target
1 Priority
default-route auto-generated
All remaining requests — served locally at zero cost
Condition: none (catches all remaining requests) → Model: qwen3-code
Advanced features
Cost optimizer
Prefers the cheapest backend that meets quality and latency requirements — estimated 74% cost saving vs. all-cloud baseline
Data residency enforcement
Detects sensitive / classified content and hard-routes it to on-premise OpenShift AI — data never leaves the corporate network
Latency scoring
Tracks per-backend P50/P99 latency in a rolling window and overflows to a faster backend when SLA budgets are exceeded
PII detection & redaction
Strip personally identifiable information before any request leaves the on-prem perimeter
Jailbreak detection
Block prompt injection and policy-violation attempts before they reach any backend
config.yaml — enterprise-router
version: v0.1

listeners:
  - name: "http-8901"
    address: "0.0.0.0"
    port: 8901
    timeout: "300s"

providers:
  models:
    - name: "qwen3-code"
      tier: "local"
      param_size: "32b"
      endpoints:
        - name: "ollama-local"
          weight: 100
          endpoint: "localhost:11434"
          protocol: "http"
      capabilities: ["coding", "debugging"]
      quality_score: 0.82
      pricing:
        prompt_per_1m: 0.0
        completion_per_1m: 0.0

    - name: "ibm-granite-3.3-8b-instruct"
      tier: "openshift"
      param_size: "8b"
      endpoints:
        - name: "rhoai-primary"
          weight: 100
          endpoint: "rhoai.corp.example/v1/chat/completions"
          protocol: "https"
      data_residency: "on-prem"
      capabilities: ["enterprise", "analysis", "secure"]
      quality_score: 0.88
      pricing:
        prompt_per_1m: 0.50
        completion_per_1m: 0.50

    - name: "claude-sonnet-4-5@20250929"
      tier: "vertex"
      endpoints:
        - name: "vertex-primary"
          weight: 100
          endpoint: "us-east5-aiplatform.googleapis.com/v1/projects/my-gcp-project/locations/us-east5/publishers/anthropic/models/claude-sonnet-4-5@20250929"
          protocol: "https"
      gcp_project: "my-gcp-project"
      gcp_region: "us-east5"
      capabilities: ["reasoning", "coding", "analysis", "creative"]
      quality_score: 0.96
      pricing:
        prompt_per_1m: 3.00
        completion_per_1m: 15.00

  default_model: "qwen3-code"

signals:
  keywords:
    - name: "sensitive_data"
      operator: "OR"
      keywords: ["confidential", "internal only", "GDPR", "PII", "personal data", "classified", "proprietary", "trade secret", "NDA", "restricted"]
      case_sensitive: false
  semantic:
    - name: "high_complexity"
      embedding_model: "nomic-embed-v2"
      threshold: 0.75
      reference_prompts:
        - "multi-step reasoning and formal analysis"
        - "architectural design at scale"
        - "complex system-level refactoring"

decisions:
  - name: "residency-route"
    description: "Sensitive data stays on-premise"
    priority: 100
    rules:
      operator: "OR"
      conditions:
        - type: "keyword"
          name: "sensitive_data"
    modelRefs:
      - model: "ibm-granite-3.3-8b-instruct"

  - name: "complexity-route"
    description: "High-complexity tasks escalated to Vertex AI"
    priority: 80
    rules:
      operator: "OR"
      conditions:
        - type: "semantic"
          name: "high_complexity"
    modelRefs:
      - model: "claude-sonnet-4-5@20250929"

  - name: "latency-route"
    description: "Overflow to OpenShift AI when local P99 exceeds 400ms"
    priority: 50
    rules:
      conditions:
        - type: "latency"
          backend: "qwen3-code"
          metric: "p99"
          threshold_ms: 400
    modelRefs:
      - model: "ibm-granite-3.3-8b-instruct"

  - name: "default-route"
    description: "All remaining requests — served locally at zero cost"
    priority: 1
    rules:
      conditions: []
    modelRefs:
      - model: "qwen3-code"

advanced:
  cost_optimizer: true
  data_residency: true
  latency_scoring: true
  latency_window_s: 60
  pii_detection: true
  jailbreak_detection: false