The most common mistake when you set up a local model is treating the choice as a memory problem. You check your GPU's GB, find the biggest model that fits, download it and go. It works… until you ask it to reason through a math problem and it hands you an answer that's confident, fluent and wrong.
hibrid's new task→model mapping starts from a different premise: competence first, hardware second. A model isn't "good" or "bad" in the abstract. It's good on an axis. And the right question isn't whether it fits, but whether it's competent on the axis your task needs and fits on this machine.
a model doesn't have one grade, it has five
hibrid sorts every task into one of five axes:
- general — short instructions, extraction, summaries
- writing — drafting, copy, long-form
- code — generating and fixing code
- reasoning — math, proofs, deep debugging
- multilingual — translation and work in Spanish
Why splitting them matters: a model trained hard on code learns to predict the next line, not to hold a long chain of deduction. Qwen2.5-Coder writes excellent functions and still gets lost in a multi-step reasoning problem. They're different skills, trained on different data. Treating "it's good" as a single grade is what leads you to use the wrong model with total confidence.
who wins each axis locally (≤14B)
From benchmark data, not intuition:
| Axis | Local winner | Signal |
|---|---|---|
| Code | Qwen2.5-Coder | 7B: HumanEval+ 84.1 · 14B: 87.2, BigCodeBench 48.4 |
| Writing / general | Qwen3-8B | IFEval 85, MMLU 85.9 |
| Multilingual (Spanish) | Aya-Expanse-8B | m-ArenaHard 76.6% |
| Reasoning / math | Phi-4-reasoning-plus | AIME-24 81%, GPQA 69% |
Three quick reads. In code, size wins: Qwen2.5-Coder's 14B climbs to 87.2 on HumanEval+ against the 7B's 84.1. If the task is code and you have the VRAM, the bigger model pays off. In writing, it doesn't: Qwen3-8B beats Qwen2.5-14B at half the parameters. Bigger isn't better; better-trained is.
In reasoning, mind the fine print. Phi-4-reasoning-plus and the DeepSeek-R1 distills are thinking models: they think out loud before answering, and that costs 2–5× the tokens and latency. They shine on a hard problem; they're a waste for classifying an email.
the other half: what your machine can hold
Knowing who wins each axis is useless if the winner doesn't fit. The second factor is the node's real power, and it comes down to two numbers. The first is the memory footprint: at Q4 quantization, count roughly 0.6 GB per billion parameters. An 8B runs about 5 GB in weights alone. The second is the KV-cache, which grows with context: the longer the conversation, the more memory you eat on top of the weights. Most people size for the weights and forget the context.
From that comes a simple node taxonomy:
| Tier | Memory | Model it holds |
|---|---|---|
| cpu_small | ≤8 GB | up to 3B |
| cpu_large / Apple 16 GB | ~16 GB | 7B |
| gpu_12gb | 12 GB | 8B |
| gpu_24gb+ | 24 GB+ | 32B |
Q4_K_M is the sweet spot: it trims the weight without the quality showing in practice. And an operating floor for interactive use: above 15 tok/s. Below that, writing with the model turns into waiting for the model.
how hibrid solves it
The router joins the two halves into a single decision, in this order:
- It classifies the task and assigns it an axis.
- It picks the best model on that axis that fits this specific node. Not the biggest: the best on the right axis within the memory budget.
- If the task is hard, it escalates to the paid tier — Opus, GPT — using the user's own subscription via
cli:claude, no API key and no separate bill.
The point is this isn't a black box. The table task_type → (axis, tier ladder, per-tier preference) is the source of truth, and it's served whole. A translation stays local with Aya and never touches the expensive tier; a math proof jumps straight to the strong model. Every one of those rules is readable and auditable.
your turn
Hit GET /v1/policy on your own node. You'll see the five axes, the tier ladder for each task type, and which local model wins on your machine with the memory you have. If something doesn't line up with your hardware, there's the number that explains it — not in a forum, in your endpoint.
pip install git+https://github.com/vfalbor/hibrid.git
hibrid serve
curl localhost:8095/v1/policy # the 5 axes and the model that wins on YOUR machine
The router that knows your machine. Open source. Yours.