I've Been Auto-Scoring Every LLM Tool on Hacker News for Weeks

A few weeks ago I built a pipeline to deal with HN overload: scrape the front page, filter real installable tools, run them in Docker, score them across 11 criteria, send me a daily digest.

I didn't expect the data to be interesting on its own.

HN points don't predict install success. Tools with 400+ points failed to install more often than tools with 80. The crowd votes on ideas, not on whether the thing actually works.

The hardest classification problem isn't spam. It's distinguishing "a new tool" from "an article about a tool." A post titled "I built X" often links to a Medium post which links to a GitHub repo which has no README. The LLM gets this wrong more than I expected.

LLM-generated test scripts work about 70% of the time. The model reads the repo, decides the install strategy, writes Python, and runs it blind in a container. Compiled languages and tools with system dependencies break it. Python-first tools almost always pass.

Novelty scores high, maturity scores low. Which is expected. But the gap is larger on days when a specific domain clusters — four similar tools in one week all score "niche" against each other even if each would score "strong candidate" in isolation.

How it works

Every day at 15:00 UTC the pipeline runs seven steps:

Scrape the top 30 items from the HN front page
Classify each one: is it actual installable software, or an article/opinion? (llama-3.1-8b via Groq)
For each real tool: generate a Python QA script tailored to the app type, run it in a python:3.12-alpine container (512MB, 180s timeout)
Enrich with GitHub stats, PyPI/npm metadata, HN comment sentiment
Score across 11 weighted criteria — llama-3.3-70b with justifications citing actual numbers
Generate a ranked markdown report
Send daily email digest. Fridays: a ranked Top 5 for the week

The scoring rubrics and classification prompts live in /skills/ — plain markdown files tunable to your interests. Security tools, infra, ML research — fork it and change the weights.

What this is not

A replacement for HN. HN is where I discover everything. This just helps me not miss the good stuff when I can't sit down for an hour to read. The signal is still there; the pipeline just helps me find it faster.

The whole thing is CC BY 4.0. If you want to improve the classification logic, scoring criteria, or add support for other package registries (Crates, Go modules, etc.) — PRs are welcome. The prompts especially need more eyes.

I'm mainly curious whether the "points don't predict install success" finding matches anyone else's intuition — or if my sample is just too small yet.

I've been auto-scoring every tool that hits HN for weeks – here's what I found

How it works

What this is not