The 85.7% Retraction: Why Self-Correction Matters More Than Being Right

We retracted our convergence rate. The original number was 85.7%. The correct number is 75.0%. The difference — 10.7 percentage points — came from double-counted findings in overlapping expert domains. When multiple specialists examine the same subsystem from different angles, their findings can overlap. Our original calculation counted each finding once per expert rather than once per unique issue. After deduplication, the convergence rate dropped. We published the correction within hours of detection.

85.7% → 75.0%

The error was caught by cross-model verification — a core FDRP mechanism where three independent LLMs analyse the same data separately. Claude Opus computed 85.7%. GPT-5.4 computed 75.0%. Gemini 3.1 flagged the discrepancy between the two and identified the double-counting as the root cause. Two-out-of-three agreement identified the error. This is exactly what N≥3 cross-model verification is designed to catch: systematic errors that a single model cannot see in its own output. A single LLM computing a number has no independent check. Two LLMs might agree on the same wrong answer if the error is subtle enough. Three independent computations with disagreement detection create a quorum that forces investigation.

Why does this matter beyond a corrected percentage? Most AI systems optimise for appearing correct. Confidence is rewarded. Hedging is penalised. Users prefer definitive answers, and models learn to provide them. FDRP optimises for being correct, which is a fundamentally different objective. The distinction seems trivial when you are writing blog posts. It stops being trivial when you are building safety-critical plans for a CHF 147M particle physics facility. In that context, the 10.7 percentage point difference represents real engineering decisions: which subsystems need additional review, how many expert rounds to run, where to allocate verification resources. An inflated convergence rate would have led to premature sign-off on subsystems that still needed work.

We think of retractions as an intellectual immune system. A body that cannot mount an immune response looks healthy until it encounters a pathogen, and then it has no defences. A research system that cannot retract looks rigorous until it publishes an error, and then it has no correction mechanism. FDRP's RETRACTION method — one of 104 catalogued methods in the meta-ontology — formalises this: detect error, publish correction, update all downstream artifacts, verify propagation. The retraction is not an exception to the process. It is the process working correctly. A system that has never retracted anything has either never made an error (unlikely) or never caught one (concerning).

Concretely, what changed? The convergence calculation now deduplicates findings before computing rates. Findings are grouped by the underlying issue they address, not by the expert who reported them. If three radiation shielding specialists all identify the same concrete thickness concern, that counts as one finding with three confirmations, not three separate findings. The fix is three lines of SQL. The lesson — always verify denominators in aggregation queries — is now a permanent rule (BIND-020: Grounding Gate on All Agent Output). Every number published by FDRP must trace back to a measurement source. Numbers without provenance are flagged as ungrounded by default.

The retraction is recorded in git commit eb3dac0, visible on the timeline page, and referenced in the paper's correction notice. We chose to make it prominent rather than buried because the ability to self-correct is more valuable than the appearance of infallibility. If you are evaluating FDRP as a methodology, the retraction tells you more about its reliability than the corrected number does.