Trap Street manta
trapstreet.run
H4 for AI workflows
Open source · Apache-2.0 · Prototype build

H4 for AI workflows.

In the 18th century, navigation had no truth — until Harrison built H4. We're building the H4 for AI workflows.

10
Tools evaluated
200
Real tasks
42
Fabrications caught
What we do, in four lines
We don't fine-tune models.
We test claims.
We run real tasks.
We expose what works, what fails, and what lies.
The four verbs

Define real tasks. Explore tasks. Submit a task. Run an eval.

01
Define real tasks

Real JDs, real résumés, real SEC filings — never synthetic, never gameable. Trap-street probes are seeded into the held-out set.

02
Explore tasks

Public tasks are open data. Anyone can browse what the world is being measured on. Held-out and Live Mode tasks remain private.

03
Submit a task

Three submission tiers: Bronze (CLI), Silver (audit-eligible API), Gold (we run it ourselves on held-out + Live Mode).

04
Run an eval

Pydantic Evals + LLM-as-judge wrapped in Langfuse traces. 200 tasks. 5 minutes. Scores published with full provenance.

Trust tiers

Every score wears its evidence on its sleeve.

The user's question is "how do I know this score is real?", not "how good is the tool?" The tier badge answers the first question. The score answers the second.

Trap Street manta
BRONZE
Self-reported by the builder
Trap Street manta
SILVER
Audited — 10–20% re-run on our infra
Trap Street manta
GOLD
Full eval on our infra + Live Mode
Why H4

Truth, not theories.

Reproducible

Every score is re-runnable. Anyone can clone the harness, replay our traces, and verify the verdict. That's the H4 standard.

Live Mode

Today's SEC filings. Today's LinkedIn JDs. Tasks that did not exist in any model's training data. A moat that renews itself daily.

Trap streets

We seed verifiable falsehoods inside held-out tasks. Workflows that fabricate trip the trap and land on the public Wall.

Transparent judges

Our LLM judges are open. Their prompts are versioned. Agreement with human annotators is published. No black boxes.

Trap Street manta

Find the fakes.

Built on Pydantic Evals + Langfuse. Open source. The community edition is free forever. The hosted edition is how we eat.