I wrote a 400-line pipeline that installs and scores every LLM tool on Hacker News overnight

By last November I'd given up on Show HN. Thirty launches a day, half of them "I built a thing with Claude," and cloning each one to decide if it was worth my afternoon had stopped being fun. So I wrote a pipeline. It pulls the HN front page every day at 15:00 UTC, boots a python:3.12-alpine container per tool, installs whatever the repo asks for, runs a QA script an LLM writes on the fly, and scores the result across 11 criteria out of 110. Output is public at tokenstree.eu. Free. Open source. No login.

This is what today's top LLM review looks like.

Claude Token Counter, now with model comparisons

58 / 110

worth-watching · 2026-04-20

novelty 8/10

current_relevance 8/10

differentiation 7/10

hn_sentiment 6/10

community 5/10

documentation 5/10

maturity 5/10

system_requirements 5/10

performance 5/10

ease_of_use 4/10

ease_of_integration 3/10

"Low ease_of_integration: no public endpoint, no embedded API — batch use is impossible. The novelty comes from side-by-side comparison across Claude/GPT/Gemini tokenizers, which hadn't been bundled this cleanly before."

Every score is attached to a justification like that one, citing real numbers from the run.

how it works

Seven stages, all scripted, all open source:

Scrape the top 30 HN items
Classify each as tool or article (llama-3.1-8b via Groq)
For tools: generate a Python QA script tailored to the repo, run it in a 512MB Alpine container with a 180-second timeout
Enrich with GitHub, PyPI, npm, and HN comment data
Score the 11 criteria with llama-3.3-70b — each score attached to a justification that cites numbers, not adjectives
Rank and tag (must-try / worth-watching / niche / skip)
Email digest and publish to the site

The prompts and weights live in /skills/ as plain markdown. Median run: 94 seconds per tool. Monthly Groq bill: about $12.

where the rubric is wrong

I'll save you the trouble of finding it.

hn_sentiment is the criterion I trust least. It scans the HN comments with an LLM and returns 1-10. In practice it pulls the total score toward the mean — a thread with 30 comments and one grumpy reply gets a 6/10 even when the tool is genuinely good. I keep it because dropping it shuffles the rankings more than I'm comfortable with, but I'm ~60% sure I shouldn't.

Two more weak spots I already know about:

Auth-gated tools (anything needing a hosted key) fail the install test by default. The score lies until I override by hand.
Research repos with a README.md pointing to a private cluster get punished on basic_run_success. Sometimes the paper is the product.

break it

The repo is CC BY 4.0 at github.com/vfalbor/llm-daily-review. Weights live in /skills/scorer.md. If you think hn_sentiment should weigh less, open an issue with the reweighted vector applied to last week's top 5 and I'll merge. If you think the whole criterion is broken, tell me what to replace it with — I want the specific alternative, not "use vibes."

I'd also pay attention to anyone who can propose a cleaner way to handle auth-gated installs that doesn't degenerate into "trust the README."

Today's review is at tokenstree.eu. The weekly top 5 drops every Friday. I read every issue.

I wrote a 400-line pipeline that installs and scores every LLM tool on Hacker News overnight.

how it works

where the rubric is wrong

break it