Built with Opus 4.7 · Anthropic Hackathon · 2026

falsify

Lock the threshold before the run.
Then prove it.

Pre-register the claim. Hash the spec. Run clean.
Tamper the spec after the fact — CI exits 3.

The discipline is optional. The hash is not.

◆ Loading Thesis ◆

Thesis

Most agents help you build faster.
This one helps you disprove faster.

You write down the claim and the failure criteria before you look at the data. The engine locks the spec with a hash, runs the declared experiment, and refuses to let you rewrite the threshold after seeing the result.

Theorem · Q.E.D.

Theorem

Givenan AI agent producing a numeric claim about model performance.

Claimthe result is falsifiable iff the threshold was registered before the evaluation.

Proof

[I]

Declare. Lock.

Spec is required to include an executable test plan, a metric, and a pre-registered threshold. Vague specs are rejected at lock time. Once locked, the canonical YAML is hashed — any edit produces a new hash and must be justified as an amendment.

[II]

Run. Frozen.

The engine does not invent experiments. It calls the script you declared, in the environment you declared, and verifies the spec hash has not drifted between lock and run.

[III]

Verdict. Numeric.

A numeric comparison against the locked threshold. No LLM judgment, no natural-language hedging. Exit code 0 for PASS, 10 for FAIL, 3 for tampered — so CI can gate on it.

[IV]

Guard. Blocks.

A git commit-msg hook scans all logged verdicts and blocks commits, README edits, or marketing copy that contradict the recorded result.

∎

The threshold you set is the one you answer to.Q.E.D. · 04fa1689ac55

◆ Initializing System ◆

The System

Init. Lock. Run.
Verdict. Guard.

The claim becomes the spec. The spec becomes the hash. The hash becomes the verdict. The order is the contract.

falsify · JUJU Hypothesis A · v2

Live

Spec Ledger · falsify list4 Locked

claude_surface · dogfoodPASS · 10

cli_startup · dogfoodPASS · 58ms

test_coverage_count · dogfoodPASS · 519

juju · case_studyFAIL · 0.214

honesty −−scope dogfood1.00

Tamper detection · change the valueINTACT

threshold: above

spec_hash: 04fa1689ac55

Speccanonical YAML · SHA-256 hashed before run

Verdictnumeric compare · no LLM judgment

Guardgit commit-msg hook · blocks contradictions

Surface5 skills · 2 subagents · 3 commands · 1 MCP

exit 0PASSverdict satisfies the locked threshold

exit 10FAILverdict against the locked threshold; ship as recorded

exit 3TAMPEREDspec edited after lock · CI refuses to ship

◆ Scanning Evidence ◆

Evidence

The number changed.
The locked spec caught it.

JUJU LAB is my prediction market research system. In March 2026 its operating note recorded Hypothesis A as FAIL at -0.0139. I repositioned JUJU honestly — from “trading system” to “ranking framework.” For the hackathon, I re-ran the same hypothesis through falsify against the current ledger. The number came out materially different. The engine surfaced a disagreement the old notebook could not.

v2 event-level delta — eligible vs demoted cohort

+0.0000

Positive magnitude, small sample. The verdict is PASS against a pre-registered threshold of zero, but the engine refuses to let any “alpha confirmed” claim ship until the minimum sample size is met.

127 resolved events across the locked parent cohort. 16 eligible events, 19 demoted. Two locked specs (v1 March-frozen, v2 April-live), one contradictory historical claim. Reconciliation before marketing, not after.

Locked spec · hash 04fa1689ac55

v1 Mar Delta

+0.4023

Frozen corpus: outcome_ts < 2026-03-28. PASS against > 0 threshold.

v2 Apr Delta

+0.3463

Live corpus: full resolved ledger as of run. PASS, unreconciled with op-note.

Op-Note Mar

−0.0139

JUJU’s original March verdict. Engine cannot reproduce — methodology drift exposed.

◆ Access Protocol ◆

Source · MIT · Public Repo

v0.1.0

Your past self isthe one your futureself answers to.

It does not tell you if you’re right. It tells you if you changed the rules.

$ pip install falsify

single-file CLI · no daemon · no account

View on GitHub → 90s demo — lock, run, watch the guard block a contradiction →

Most agents help you build faster.This one helps you disprove faster.

Init. Lock. Run.Verdict. Guard.

The number changed.The locked spec caught it.

Your past self isthe one your futureself answers to.

Most agents help you build faster.
This one helps you disprove faster.

Init. Lock. Run.
Verdict. Guard.

The number changed.
The locked spec caught it.