Routing benchmark
Real calls. Real models. Real savings.

We ran 574 unique task prompts across 11 categories and 40 subcategories. Every prompt was sent to both the adaptively routed specialist and the GPT-4o generalist baseline. No synthetic data, no cherry-picking — just transparent head-to-head results from real API calls.

Routing Win Rate

75%

784 of 1,090 tasks routed better than baseline

Cost Savings

68%

$0.87 routed vs $2.71 baseline per 1k tasks

Quality Delta

+4.4 pts

Routed outputs scored higher, not just cheaper

Total Evaluations

1,090

Across 2 training runs with adaptive learning

By Category

Where routing delivers the most value

Performance varies by task type. Routing excels at high-level reasoning tasks (analysis, planning, explanation) where choosing the right specialist model matters most. Even in competitive categories like code generation, routing delivers cost savings while maintaining quality.

Category	Tasks	Win Rate	Δ Quality
Analysis	21	100%	+20.5 pts
Planning	15	100%	+21.2 pts
Explanation	21	95%	+15.4 pts
Summarization	40	78%	+9.8 pts
Creative Writing	43	69%	+9.0 pts
Math Reasoning	105	66%	—
Data Extraction	45	60%	+6.8 pts
Email Writing	20	55%	+4.5 pts
Code Generation	166	56%	—
Translation	54	52%	+2.7 pts
Comparison	14	67%	+11.1 pts

Integration

One API call. The network does the rest.

Replace your existing model calls with a single Orob endpoint. The protocol classifies your prompt, selects the optimal model from the network, executes it, and returns the result — while the routing graph learns from every outcome.

your-app.ts

// Before — single generalist, fixed cost
const res = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: prompt }]
});

68% cheaper · +4.4 pts quality · same interface

// After — intelligent routing across the network
const res = await orob.route({ prompt });
// Mistral Small for summaries · GPT-4.1 Nano for extraction
// DeepSeek for reasoning · Codestral for code

Learned specializations

The router discovers which models win at which tasks

These specializations weren't manually programmed. The Thompson sampling bandit explored the model space and converged on optimal assignments through real performance data. The graph updates continuously from live traffic.

Mistral Small 3.2

Volume champion — fastest cost-effective generalist

89%

Win Rate

161

Tasks

$0.21

Avg Cost

summarization math translation

GPT-4.1 Nano

Ultra-low cost with strong NLP performance

72%

Win Rate

137

Tasks

$0.26

Avg Cost

extraction email general NLP

Mistral Small 4

Quality-cost balance for analytical tasks

88%

Win Rate

Tasks

$0.53

Avg Cost

analysis explanation comparison

DeepSeek V3.2

Deep reasoning at a fraction of frontier cost

92%

Win Rate

Tasks

$1.13

Avg Cost

code planning creative

GPT-5.4 Mini

Premium quality for complex, nuanced tasks

82%

Win Rate

Tasks

$4.29

Avg Cost

creative writing complex code

Mistral Medium

Precision specialist for structured output

90%

Win Rate

Tasks

$2.87

Avg Cost

planning analysis comparison

Methodology

How we measure quality

Every benchmark result is validated through multiple independent signals. No single metric determines the outcome — the protocol blends them to build a complete picture of model performance for each task type.

Pairwise LLM Judging

An independent judge model (GPT-4.1-mini) compares routed vs baseline outputs blind. Position is randomized to prevent bias. Both outputs scored independently on relevance, completeness, accuracy, and clarity.

Deterministic Testing

Code generation tasks are executed in a sandboxed environment with real test cases. Pass/fail is determined by actual program output — not LLM opinion. Linting and static analysis provide additional quality signals.

Thompson Sampling

Each model-arm maintains a Beta posterior distribution over quality. Uncertain arms are naturally explored more. The protocol balances quality (60%), cost (20%), and speed (15%) in a single value score per routing decision.

574

Unique Prompts

→

1,090

API Calls

→

2,180

Pairwise Comparisons

→

Categories Scored

Every task was sent to both the routed specialist and the GPT-4o baseline. Real prompts, real API calls, real token costs.

Routing benchmark Real calls. Real models. Real savings.

Where routing delivers the most value

One API call. The network does the rest.

The router discovers which models win at which tasks

How we measure quality

Routing benchmark
Real calls. Real models. Real savings.