We had three agents grade our own router

Two weeks ago we shipped hibrid, an open-source router that sits under your AI tools and keeps the cheap, repetitive calls on a local model so the frontier tier only gets the work that earns it. Last week we metered a single agent session and saw 42% of frontier tokens avoided. Fair, but small — and graded by us.

So this week we did something harsher. We pointed three independent agents at the router and told them to measure it, not flatter it: one counted tokens with and without it, one judged answer quality, one scanned the competition. They ran on the worst box we have — two CPU cores, 8GB, no GPU — and they wrote down everything, including the parts that don't look good. Here's the scorecard.

87%

of frontier tokens avoided

~89%

of frontier answer quality kept (LLM-judged)

11 / 12

routing rules correct

1. how much it saved

The first agent ran 31 calls twice each — once through hibrid (let it route) and once forced to the frontier (what you'd pay without a router) — and compared the token bills. Three workloads: a long refactor loop, a mixed agent session, and a reasoning-heavy run.

workload	calls	kept local	frontier tokens avoided
refactor loop	12	100%	100%
mixed agent session	10	20%	61%
reasoning-heavy	9	100%	100%
overall	31	74%	87.5%

Across the lot: 1,069 frontier tokens with hibrid versus 8,531 without — 87.5% avoided. The refactor loop, the bread and butter of any coding agent, stayed entirely local. The number is bigger than last week's 42% because this mix is more loop-heavy, which is exactly the point: the more your agent grinds, the more of it runs for free.

2. did the savings cost quality?

This is the question that actually matters, and the one most "we cut your costs" posts skip. Cheap is easy if you don't mind worse answers. So the second agent answered eight tasks twice — local and frontier — and had the frontier model act as a blind judge, scoring each answer 0 to 1 against a fixed rubric without being told which model wrote it.

Local answers scored 0.81 on average; the frontier scored 0.91. So the work hibrid kept off the paid tier held about 89% of frontier quality, and 75% of local answers were at parity (within a hair of the frontier score). On easy code fixes, classification, extraction and short explanations, the small local model was genuinely indistinguishable from the big one.

And the honest part: two tasks were clearly worse on local — one code fix (0.5 vs 1.0) and a proof that comparison sorts need Ω(n log n) comparisons (0.5 vs 0.9). Those are precisely the hard items a 1–3B model should not be trusted with. The router is supposed to send those up; on this run, with the box idle, it kept them local and the judge caught the drop. That's a real limitation we'll name in the caveats — hibrid escalates on what a task looks like, not yet on whether the local answer came back good.

3. does it route correctly?

The third check was a 12-rule probe: do the routing guarantees actually hold? Eleven passed. Every hard rule was correct — a prompt with an email, an SSN or a credit-card number stays local even when it's tagged as hard reasoning; allow_cloud=false keeps everything on the machine; a forced destination is always honored; the refactor loop never jumps to the expensive tier. Privacy and the offline switch behave exactly as promised.

The one miss is soft, and we're flagging it rather than hiding it: under load, simple tasks routed to the cheap paid model (haiku) instead of staying local. They never hit the expensive frontier, so it's a tiering nuance, not a cost blowout — but "simple stays local" isn't strictly true in this build, and we'd rather you hear it from us.

4. how it compares

The fourth agent went out and read the field. Routing isn't new — the honest question is where hibrid is actually different, and where it's behind.

tool	local-first	no API key	measures your machine	open source
hibrid	yes	yes	yes	Apache-2.0
RouteLLM	no	no	no	yes
OpenRouter	no	no	no	no
Martian	no	no	no	no
LiteLLM + Ollama (DIY)	yes	no	no	yes
semantic-router	partial	n/a	no	yes

Two things no router we found does together: hibrid reaches the strong tier through the agent CLI you're already signed into (claude, codex, opencode, copilot), so there's no metered per-token key for frontier calls; and it benchmarks your actual hardware at startup and routes on measured speed, not a spec sheet. The popular DIY LiteLLM-plus-Ollama setup is local-first too, but it's manual and still needs a cloud key.

Where the others are plainly ahead, because pretending otherwise would be silly: RouteLLM has peer-reviewed numbers we haven't matched — about 85% cost reduction on MT-Bench while holding 95% of GPT-4 quality. OpenRouter and Portkey cover hundreds to thousands of models with observability and guardrails we don't have. Martian (Accenture-backed, reportedly near a billion-dollar valuation) and Not Diamond have funding and large private eval sets we can't touch. The field's headline savings run from ~32% at no accuracy loss (LLMRouterBench, 2026) to FrugalGPT's contested ~98%. We're a two-week-old project with an 8-task quality run; treat our numbers as a floor with its method attached, not a trophy.

what's soft — on purpose

Small samples (31 calls, an 8-task judged set), a single LLM judge, and short tasks with crisp right answers — indicative, not a leaderboard. CPU-only with 1–3B models is the floor; a GPU or Apple-silicon box runs bigger local models and lifts both coverage and quality. The strong tier here is a flat-rate subscription, so we report frontier tokens avoided, not dollars. And the big one: the router decides before it sees the answer, so it can't yet catch a confidently-wrong local result — the two quality misses above are exactly that gap, and closing it (an output-aware escalation signal) is next on the list. Every script and raw result is in the repo under docs/benchmarks/study2/.

try it, and grade it yourself

pip install git+https://github.com/vfalbor/hibrid.git
hibrid serve
curl localhost:8095/v1/node     # what it measured about your machine
curl localhost:8095/v1/metrics  # calls kept local vs escalated, live

Point your agent at hibrid, run your real loop, and watch the split. If it does worse on your box than on ours, we want the numbers — that's how the routing gets better for the next machine like yours.

The router that knows your machine. Open source. Yours.

We had three agents grade our own router.