Routing benchmark
Real calls. Real models. Real savings.

We ran 574 unique task prompts across 11 categories and 40 subcategories. Every prompt was sent to both the adaptively routed specialist and the GPT-4o generalist baseline. No synthetic data, no cherry-picking — just transparent head-to-head results from real API calls.

Routing Win Rate
75%
784 of 1,090 tasks routed better than baseline
Cost Savings
68%
$0.87 routed vs $2.71 baseline per 1k tasks
Quality Delta
+4.4 pts
Routed outputs scored higher, not just cheaper
Total Evaluations
1,090
Across 2 training runs with adaptive learning

Where routing delivers the most value

Performance varies by task type. Routing excels at high-level reasoning tasks (analysis, planning, explanation) where choosing the right specialist model matters most. Even in competitive categories like code generation, routing delivers cost savings while maintaining quality.

Category Tasks Win Rate Δ Quality
Analysis21100%+20.5 pts
Planning15100%+21.2 pts
Explanation2195%+15.4 pts
Summarization4078%+9.8 pts
Creative Writing4369%+9.0 pts
Math Reasoning10566%
Data Extraction4560%+6.8 pts
Email Writing2055%+4.5 pts
Code Generation16656%
Translation5452%+2.7 pts
Comparison1467%+11.1 pts

One API call. The network does the rest.

Replace your existing model calls with a single Orob endpoint. The protocol classifies your prompt, selects the optimal model from the network, executes it, and returns the result — while the routing graph learns from every outcome.

your-app.ts
// Before — single generalist, fixed cost
const res = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: prompt }]
});
68% cheaper · +4.4 pts quality · same interface
// After — intelligent routing across the network
const res = await orob.route({ prompt });
// Mistral Small for summaries · GPT-4.1 Nano for extraction
// DeepSeek for reasoning · Codestral for code

The router discovers which models win at which tasks

These specializations weren't manually programmed. The Thompson sampling bandit explored the model space and converged on optimal assignments through real performance data. The graph updates continuously from live traffic.

Mistral Small 3.2
Volume champion — fastest cost-effective generalist
89%
Win Rate
161
Tasks
$0.21
Avg Cost
summarization math translation
GPT-4.1 Nano
Ultra-low cost with strong NLP performance
72%
Win Rate
137
Tasks
$0.26
Avg Cost
extraction email general NLP
Mistral Small 4
Quality-cost balance for analytical tasks
88%
Win Rate
88
Tasks
$0.53
Avg Cost
analysis explanation comparison
DeepSeek V3.2
Deep reasoning at a fraction of frontier cost
92%
Win Rate
36
Tasks
$1.13
Avg Cost
code planning creative
GPT-5.4 Mini
Premium quality for complex, nuanced tasks
82%
Win Rate
33
Tasks
$4.29
Avg Cost
creative writing complex code
Mistral Medium
Precision specialist for structured output
90%
Win Rate
31
Tasks
$2.87
Avg Cost
planning analysis comparison

How we measure quality

Every benchmark result is validated through multiple independent signals. No single metric determines the outcome — the protocol blends them to build a complete picture of model performance for each task type.

Pairwise LLM Judging
An independent judge model (GPT-4.1-mini) compares routed vs baseline outputs blind. Position is randomized to prevent bias. Both outputs scored independently on relevance, completeness, accuracy, and clarity.
Deterministic Testing
Code generation tasks are executed in a sandboxed environment with real test cases. Pass/fail is determined by actual program output — not LLM opinion. Linting and static analysis provide additional quality signals.
Thompson Sampling
Each model-arm maintains a Beta posterior distribution over quality. Uncertain arms are naturally explored more. The protocol balances quality (60%), cost (20%), and speed (15%) in a single value score per routing decision.
574
Unique Prompts
1,090
API Calls
2,180
Pairwise Comparisons
11
Categories Scored
Every task was sent to both the routed specialist and the GPT-4o baseline. Real prompts, real API calls, real token costs.