An agent's refactor loop: eight calls, zero paid

You point a coding agent at a frontier model and let it run. It writes a function, runs the test, the test fails, it reads the error, fixes a typo, runs the test again. Ten, twenty, fifty rounds. Every one of them billed at frontier prices — and almost none of them is hard. Fixing retrun to return does not need the smartest model on earth.

That's the bill hibrid was built to cut. It's our open-source router that sits under your AI tools and decides, per call, what runs free on your own machine and what's worth handing up to a stronger model. Last week we benchmarked how fast local models run on CPU servers. This week we asked the more honest question: across a real agent session, how much of the work never has to reach the frontier at all? So we metered one.

what we ran

We took a sixteen-call session shaped like an actual agent loop and sent it through hibrid on the smallest box we have — two CPU cores, 8GB of RAM, no GPU. Eight of the calls were the bread and butter of any coding loop: fix a typo, refactor to a comprehension, add a docstring, add type hints, guard a divide-by-zero. Then four light ones — translate, classify sentiment, extract an email, summarize. Two general questions. And two genuinely hard one-shots: design a backpressure strategy for a queue, prove the n log n sorting bound. Those last two are the ones that should cost.

The strong tier here is reached through a Claude subscription via the agent CLI — no API key, no per-token meter. So we're not going to quote you a dollar figure we can't back up. What we can count exactly, from each response's own usage block, is frontier tokens and calls avoided: the work that stayed on the local machine instead of going up the ladder.

9 / 16

calls kept local and free (56%)

42%

of frontier tokens avoided

8 / 8

refactor-loop calls stayed local

the loop ran free

Here's the result that matters. All eight code-loop calls were handled locally — split automatically between llama3.2:3b and the qwen2.5-coder:1.5b specialist — and not one of them touched a paid model. The typo fix, the comprehension, the docstring, the type hints: that's the high-frequency churn of an agent session, and on this tiny box it cost nothing but local compute. The router kept the busywork where busywork belongs.

The two hard reasoning prompts went up the ladder, exactly as they should. And the lighter NLP tasks split: one stayed local, the rest escalated to a cheap frontier model rather than the most expensive one — the router doesn't reach for the top of the ladder when the middle will do.

task type	calls	stayed local	escalated
refactor loop	8	8	0
simple NLP	4	1	3
general Q&A	2	0	2
hard reasoning	2	0	2

Counterfactual: if every one of those sixteen calls had gone to the frontier — the default when there's no router — that's 2,882 frontier tokens. Through hibrid, only 1,681 did. 1,201 frontier tokens, 42% of the total, simply never happened. And the share that did escalate is concentrated in the two prompts that genuinely earned it.

why the loop is where the saving lives

This isn't a coincidence of our task mix. A real agent session is mostly loop — the same small edit-and-retest cycle, over and over, with a few hard decisions sprinkled in. That shape is exactly what the headline number rewards: the more your agent loops, the larger the fraction that hibrid keeps local, because loops are cheap calls and cheap calls stay home. The hard one-shots are rare and they're the ones you actually want a frontier model for.

It also means the saving grows with the boring work, not the exciting work. The flashier the demo — one big clever prompt — the less hibrid does for you. The grindier the session — fifty rounds of fix-and-retest — the more of it runs for free.

where this is soft — on purpose

Sixteen calls is one session, not a leaderboard; treat these as indicative. The strong tier here is a flat-rate subscription reached with no API key, so the honest unit is frontier tokens and calls avoided, not dollars saved — your dollar figure depends on what you'd otherwise have paid per token. This ran on a two-core, 8GB, GPU-less box with 1–3B local models, which is the floor: a GPU or an Apple-silicon machine runs larger local models and keeps even more off the paid tier. And the local models have to be good enough — last week's study showed a 0.5B model inventing email addresses, which is exactly why hibrid escalates instead of trusting a too-small model blindly. The run script and its raw output are in the repo, under docs/benchmarks/.

try it on your own session

pip install git+https://github.com/vfalbor/hibrid.git
hibrid serve
curl localhost:8095/v1/node     # what it measured about your machine
curl localhost:8095/v1/metrics  # calls kept local vs escalated, live

Point your agent at hibrid instead of straight at the frontier, run your normal loop, and watch /v1/metrics. The number you care about is the share that stayed local — and on a grindy session, it's bigger than you'd guess.

The router that knows your machine. Open source. Yours.

An agent's refactor loop: eight calls, zero paid.

what we ran

the loop ran free

why the loop is where the saving lives

where this is soft — on purpose

try it on your own session