FALSIFY · v0.1.0
computing sha-256
 
Lock the claim · Then run

Built with Opus 4.7 · Anthropic Hackathon · 2026

falsify

Lock the threshold before the run.
Then prove it.

Pre-register the claim. Hash the spec. Run clean.
Tamper the spec after the fact — CI exits 3.

The discipline is optional. The hash is not.

Loading Thesis
Thesis

Most agents help you build faster.
This one helps you disprove faster.

You write down the claim and the failure criteria before you look at the data. The engine locks the spec with a hash, runs the declared experiment, and refuses to let you rewrite the threshold after seeing the result.

Theorem · Q.E.D.
Theorem
Givenan AI agent producing a numeric claim about model performance.
Claimthe result is falsifiable iff the threshold was registered before the evaluation.
Proof
[I]
Declare. Lock.

Spec is required to include an executable test plan, a metric, and a pre-registered threshold. Vague specs are rejected at lock time. Once locked, the canonical YAML is hashed — any edit produces a new hash and must be justified as an amendment.

[II]
Run. Frozen.

The engine does not invent experiments. It calls the script you declared, in the environment you declared, and verifies the spec hash has not drifted between lock and run.

[III]
Verdict. Numeric.

A numeric comparison against the locked threshold. No LLM judgment, no natural-language hedging. Exit code 0 for PASS, 10 for FAIL, 3 for tampered — so CI can gate on it.

[IV]
Guard. Blocks.

A git commit-msg hook scans all logged verdicts and blocks commits, README edits, or marketing copy that contradict the recorded result.

The threshold you set is the one you answer to.Q.E.D. · 04fa1689ac55
Initializing System
The System

Init. Lock. Run.
Verdict. Guard.

The claim becomes the spec. The spec becomes the hash. The hash becomes the verdict. The order is the contract.

falsify · JUJU Hypothesis A · v2
Live
Spec Ledger · falsify list4 Locked
claude_surface · dogfoodPASS · 10
cli_startup · dogfoodPASS · 58ms
test_coverage_count · dogfoodPASS · 519
juju · case_studyFAIL · 0.214
honesty −−scope dogfood1.00
Tamper detection · change the valueINTACT
threshold: above
spec_hash: 04fa1689ac55
Speccanonical YAML · SHA-256 hashed before run
Verdictnumeric compare · no LLM judgment
Guardgit commit-msg hook · blocks contradictions
Surface5 skills · 2 subagents · 3 commands · 1 MCP
exit 0PASSverdict satisfies the locked threshold
exit 10FAILverdict against the locked threshold; ship as recorded
exit 3TAMPEREDspec edited after lock · CI refuses to ship
Scanning Evidence
Evidence

The number changed.
The locked spec caught it.

JUJU LAB is my prediction market research system. In March 2026 its operating note recorded Hypothesis A as FAIL at -0.0139. I repositioned JUJU honestly — from “trading system” to “ranking framework.” For the hackathon, I re-ran the same hypothesis through falsify against the current ledger. The number came out materially different. The engine surfaced a disagreement the old notebook could not.

v2 event-level delta — eligible vs demoted cohort
+0.0000

Positive magnitude, small sample. The verdict is PASS against a pre-registered threshold of zero, but the engine refuses to let any “alpha confirmed” claim ship until the minimum sample size is met.

127 resolved events across the locked parent cohort. 16 eligible events, 19 demoted. Two locked specs (v1 March-frozen, v2 April-live), one contradictory historical claim. Reconciliation before marketing, not after.

Locked spec · hash 04fa1689ac55
v1 Mar Delta
+0.4023

Frozen corpus: outcome_ts < 2026-03-28. PASS against > 0 threshold.

v2 Apr Delta
+0.3463

Live corpus: full resolved ledger as of run. PASS, unreconciled with op-note.

Op-Note Mar
−0.0139

JUJU’s original March verdict. Engine cannot reproduce — methodology drift exposed.

Access Protocol
Source · MIT · Public Repo
v0.1.0

Your past self isthe one your futureself answers to.

It does not tell you if you’re right. It tells you if you changed the rules.

$ pip install falsify

single-file CLI · no daemon · no account