H4 for AI workflows.
In the 18th century, navigation had no truth — until Harrison built H4. We're building the H4 for AI workflows.

Define real tasks. Explore tasks. Submit a task. Run an eval.
Real JDs, real résumés, real SEC filings — never synthetic, never gameable. Trap-street probes are seeded into the held-out set.
Public tasks are open data. Anyone can browse what the world is being measured on. Held-out and Live Mode tasks remain private.
Three submission tiers: Bronze (CLI), Silver (audit-eligible API), Gold (we run it ourselves on held-out + Live Mode).
Pydantic Evals + LLM-as-judge wrapped in Langfuse traces. 200 tasks. 5 minutes. Scores published with full provenance.
Every score wears its evidence on its sleeve.
The user's question is "how do I know this score is real?", not "how good is the tool?" The tier badge answers the first question. The score answers the second.
Truth, not theories.
Every score is re-runnable. Anyone can clone the harness, replay our traces, and verify the verdict. That's the H4 standard.
Today's SEC filings. Today's LinkedIn JDs. Tasks that did not exist in any model's training data. A moat that renews itself daily.
We seed verifiable falsehoods inside held-out tasks. Workflows that fabricate trip the trap and land on the public Wall.
Our LLM judges are open. Their prompts are versioned. Agreement with human annotators is published. No black boxes.
Find the fakes.
Built on Pydantic Evals + Langfuse. Open source. The community edition is free forever. The hosted edition is how we eat.