By last November I'd given up on Show HN. Thirty launches a day, half of them "I built a thing with Claude," and cloning each one to decide if it was worth my afternoon had stopped being fun. So I wrote a pipeline. It pulls the HN front page every day at 15:00 UTC, boots a python:3.12-alpine container per tool, installs whatever the repo asks for, runs a QA script an LLM writes on the fly, and scores the result across 11 criteria out of 110. Output is public at tokenstree.eu. Free. Open source. No login.
This is what today's top LLM review looks like.
ease_of_integration: no public endpoint, no embedded API — batch use is impossible. The novelty comes from side-by-side comparison across Claude/GPT/Gemini tokenizers, which hadn't been bundled this cleanly before."
Every score is attached to a justification like that one, citing real numbers from the run.
how it works
Seven stages, all scripted, all open source:
- Scrape the top 30 HN items
- Classify each as tool or article (
llama-3.1-8bvia Groq) - For tools: generate a Python QA script tailored to the repo, run it in a 512MB Alpine container with a 180-second timeout
- Enrich with GitHub, PyPI, npm, and HN comment data
- Score the 11 criteria with
llama-3.3-70b— each score attached to a justification that cites numbers, not adjectives - Rank and tag (
must-try/worth-watching/niche/skip) - Email digest and publish to the site
The prompts and weights live in /skills/ as plain markdown. Median run: 94 seconds per tool. Monthly Groq bill: about $12.
where the rubric is wrong
I'll save you the trouble of finding it.
hn_sentiment is the criterion I trust least. It scans the HN comments with an LLM and returns 1-10. In practice it pulls the total score toward the mean — a thread with 30 comments and one grumpy reply gets a 6/10 even when the tool is genuinely good. I keep it because dropping it shuffles the rankings more than I'm comfortable with, but I'm ~60% sure I shouldn't.
Two more weak spots I already know about:
- Auth-gated tools (anything needing a hosted key) fail the install test by default. The score lies until I override by hand.
- Research repos with a
README.mdpointing to a private cluster get punished onbasic_run_success. Sometimes the paper is the product.
break it
The repo is CC BY 4.0 at github.com/vfalbor/llm-daily-review. Weights live in /skills/scorer.md. If you think hn_sentiment should weigh less, open an issue with the reweighted vector applied to last week's top 5 and I'll merge. If you think the whole criterion is broken, tell me what to replace it with — I want the specific alternative, not "use vibes."
I'd also pay attention to anyone who can propose a cleaner way to handle auth-gated installs that doesn't degenerate into "trust the README."
Today's review is at tokenstree.eu. The weekly top 5 drops every Friday. I read every issue.