`sequant ready` gate — backtest recall & noise (Layer 2)
Issue: #683
Corpus: 27 captured fresh-session QA passes (.entire log study, 2026-05-30)
Method: Replay sequant ready against the pre-fix commit each fresh QA reviewed; measure recall vs. the 12 known would-ship defects and the false-positive / noise rate on clean cases.
Status: ⚠️ Methodology + corpus committed; empirical numbers pending the offline harness run. See “Execution status” below.
Question
Section titled “Question”The in-orchestrator QA under-catches: across 27 captured fresh-session passes (24 of them after an in-orchestrator QA had already passed the same work), 12/27 (44%) caught a real shipping bug or unmet AC. Does the automated sequant ready chain independently flag the same defect class — and at what noise cost? Per-policy numbers matter, because the ac default is meant to contain the noise/scope-creep that a-plus amplifies (the #608 risk).
Ground-truth corpus
Section titled “Ground-truth corpus”The 12 real-bug cases the fresh QA caught (the recall denominator):
| # | Defect class caught (in-run QA had missed it) |
|---|---|
| #421 | Feature non-functional for its primary multi-issue use case, after READY_FOR_MERGE |
| #503 | New programmatic API broken at launch |
| #467 | AC test passed vacuously — zero assertions fired |
| #529 | No implementation existed — empty branch, zero commits, reported success |
| #570 | No implementation existed — empty branch (null-verdict-as-success, #534) |
| #318 | Merge would have deleted the --reflect feature from main |
| #465 | (real-bug subset) |
| #484 | (real-bug subset) |
| #528 | (real-bug subset) |
| #554 | (real-bug subset) |
| #573 | (real-bug subset) |
| #625 | (real-bug subset) |
Clean / noise-control cases (the chain must NOT invent gaps or loop to budget): #677 plus a sample of cleanly-merged PRs that needed no fixes. Judgment-call quality gaps the team would accept-as-is or defer (#451, #559) are the a-plus over-fixing controls.
Metrics
Section titled “Metrics”For each corpus case:
- Recall (per policy). Check out the pre-fix commit the fresh QA reviewed, recreate its worktree, run
sequant ready <issue> --policy <ac|a-plus>, and record whether the chain independently flagged the same defect class.acshould hit high recall on the AC-miss subset (objective): the #529/#570 empty-branch cases must trip the #534NO_IMPLEMENTATIONguard; the #467 vacuous-assertion and #421 broken-use-case cases must surface asAC_NOT_MET.a-plusadds the quality-gap catches.- Target: ≥ 70% recall on the real-bug set before public release. Report
acvs.a-plusseparately so the policy boundary is measured, not assumed.
- False-positive / noise rate (per policy). On the clean cases, the chain must terminate at the threshold without inventing gaps or looping to
maxIterations/ budget. Measureacanda-plusseparately — higha-plusnoise validates keepingacas the default.
Method
Section titled “Method”Replay harness
Section titled “Replay harness”The mechanical parts are automated by scripts/analytics/ready-backtest.ts (#689). It reads the ground-truth corpus, derives each pre-fix SHA, sets up an issue-number-named worktree, runs sequant ready under both policies, captures the JSON, and emits the results table.
npm run build # ready-backtest invokes dist/bin/cli.jsnpx tsx scripts/analytics/ready-backtest.ts # DRY RUN: resolve SHAs + plan, no live passesnpx tsx scripts/analytics/ready-backtest.ts --run # execute the live ready passes (SLOW, token cost)npx tsx scripts/analytics/ready-backtest.ts --run --only 467,318 # subsetnpx tsx scripts/analytics/ready-backtest.ts --cleanup # remove backtest worktrees when donePer-case JSON is written to .sequant/backtest/<issue>-results.json; the run prints a heuristic recall figure. The committed recall number must come from a human reading those JSON files — the heuristic score is a starting point, not the verdict (scoring is non-deterministic LLM judgment, so this stays an offline committed report, not a CI gate — mirrors the #608 / #609 format).
Three methodology decisions (made explicit by the driver)
Section titled “Three methodology decisions (made explicit by the driver)”- Skill version —
--current-skills(default ON). Checking out a pre-fix commit also reverts.claude/skills/, so a naive run would test that commit’s old QA skill. The driver overlays the currentmainskill dirs onto the old product code, so the measurement answers “does today’sreadycatch this old bug.” Use--no-current-skillsto evaluate the historical skill instead. - Pre-fix SHA — auto-derived, confidence-flagged. The SHA is the parent of the squash-merge commit matching
(#<issue>)(<fix>~1). This is the #625-classgit log --grepfalse-positive zone, so the driver marks any non-scoped or multi-match derivation LOW-CONFIDENCE; those cases (in the current corpus: #503, #528, #625, and the #677 control) require a manualshaoverride in theCORPUSmanifest before their results are trusted. - Scoring — human-confirmed. The driver’s HIT/MISS is heuristic (
reason/finalVerdictvs. expected class). Confirm each by reading the captured JSON before filling the results table.
Scoring rubric
Section titled “Scoring rubric”| Outcome | Counts as |
|---|---|
reason: NO_IMPLEMENTATION on #529/#570 | recall hit (AC subset) |
AC_NOT_MET surfaced, remaining names the defect class | recall hit |
| Clean threshold exit on a control case, no invented gaps | true negative |
Looped to maxIterations / budget on a clean case | noise / false-positive |
⛔ Methodology blocker — the “checkout pre-fix, run ready” harness does not work (discovered 2026-06-01)
Section titled “⛔ Methodology blocker — the “checkout pre-fix, run ready” harness does not work (discovered 2026-06-01)”Operationalizing the harness via scripts/analytics/ready-backtest.ts --run surfaced that the “git worktree add <pre-fix-sha> then run ready” approach in the Replay-harness section above is invalid as written. Three concrete blockers, all verified by git, before any live pass:
- Full-weight QA’s stale-branch gate blocks every replay.
sequant readyruns full-weight QA (SEQUANT_FULL_QA=1, #683 AC-2), which runs the stale-branch check (the very check the in-run QA skips). Every corpus pre-fix commit is far behind today’smain— e.g. #467’s pre-fix commit is 171 commits behindorigin/main, vs. the defaultstaleBranchThreshold: 5. So QAexit 1s onSTALE_BRANCHbefore reviewing anything. Every case is an old commit ⇒ every case blocks. - A detached pre-fix checkout has no candidate diff to review.
git diff origin/main...<pre-fix>= 0 files (the pre-fix commit is an ancestor ofmain). The bug-era code is present but the change under review is not — QA reviews<base>...HEADand finds nothing. The fresh QA we’re trying to reproduce reviewed a PR diff, which a detached old-commit checkout does not reconstruct. - The clean control needs the post-fix state. #677 (noise control) must be replayed at its implemented state, not pre-fix — pre-fix is an empty baseline that can’t test “does the gate invent gaps on clean work.”
Deeper root cause: the backtest wants to re-present, to today’s gate, the buggy diff the historical fresh QA reviewed. For squash-merged / deleted PR branches that exact buggy state is often unrecoverable.
Candidate redesigns (a design decision, not just “run the script”)
Section titled “Candidate redesigns (a design decision, not just “run the script”)”| Approach | Mechanism | Trade-offs |
|---|---|---|
| A — base override | Add ready --base <era-sha>; replay the fix’s diff as a branch on the pre-fix base; point QA’s diff + stale checks at <era-sha> (not origin/main). | Faithful to history, but needs a new CLI/QA-base primitive (skill changes ×3) and the historical buggy diff (often unrecoverable for squash-merged branches). |
| B — revert-on-current-main (looks simpler) | Branch off current main, git revert --no-commit <fix> to reintroduce the bug as a candidate diff, run ready with the default base. | No new feature: behind main = 0 (no stale block) and the diff = the reintroduced bug. Tests “does today’s gate catch this defect class today” — arguably more relevant. Caveats: the revert may conflict after drift; it reviews a removal of the fix (which is itself the #318-class regression test); #534 empty-branch cases (#529/#570) still need their own empty-worktree setup. |
Approach B needs only driver changes (no ready feature); A is the more faithful but heavier primitive.
Approach B was tested — it also fails on this corpus (2026-06-01)
Section titled “Approach B was tested — it also fails on this corpus (2026-06-01)”A manual revert-on-main trial on #573 (a focused skill-prompt fix) conflicted immediately: git revert <fix> collided on skills/qa/SKILL.md (U unmerged), because that file has been edited many times since the fix. Surveying the corpus confirms this is systemic, not a one-off:
- 7 of the 10 derivable fixes touch high-churn files (
qa/SKILL.md,run.ts,skills/,batch-executor,phase-executor) — #421, #467, #318, #465, #484, #554, #573. Their reverts conflict against the drift. - The other 3 (#503, #528, #625) are the low-confidence cases whose
(#N)match is a docs commit — there is no code defect to revert in the first place.
So the corpus is squeezed from both sides: high-churn fixes don’t revert cleanly, and the clean-reverting remainder has no real defect. Both A and B are empirically non-viable for an automated run — A is blocked by the stale-gate + empty-diff, B by revert conflicts on actively-developed files. A valid run would require bespoke per-case reconstruction (hand-resolving each revert conflict, or recovering the historical buggy PR diff from .entire), which is archaeology, not a script.
Conclusion & recommendation
Section titled “Conclusion & recommendation”Automated Layer-2 backtest of this historical corpus is not cost-effective. The deterministic guarantees AC-7 actually cares about — the #534 empty-branch / null-verdict guard, the policy thresholds, and the budget/stagnation exits — are already pinned by ready-gate.test.ts in CI (see Falsifiability). The realistic empirical validation is Layer 3 (forward shadow dogfood): run sequant ready alongside the manual fresh /qa on the next ≥10 live issues, where the candidate diff exists by construction and no archaeology is needed. Recommend descoping AC-7’s “measured historical recall” to the Layer-3 forward cohort and treating this document as the record of why the backtest-against-history path was abandoned.
Execution status
Section titled “Execution status”⚠️ Automated historical backtest abandoned as infeasible (see above). Validation shifts to Layer-3 forward shadow dogfood (#689 AC-2). This document is retained as the methodology + corpus + the evidence that both replay approaches (A: stale-gate/empty-diff; B: revert conflicts on drift) fail for an automated run. The results table below will be filled from the Layer-3 forward cohort, not a historical run.
Results (to be filled by the offline run)
Section titled “Results (to be filled by the offline run)”| Policy | Real-bug recall | AC-subset recall | Noise rate (clean cases) |
|---|---|---|---|
ac | pending | pending | pending |
a-plus | pending | pending | pending |
Promotion gate (from the issue rollout plan): ≥ 70% recall on the real-bug set with acceptable noise → graduate ready from experimental to a documented default. If a-plus noise is high, that validates keeping ac as the default rather than blocking release.
Falsifiability
Section titled “Falsifiability”The deterministic plumbing this report depends on is covered in CI:
#534guard (empty branch / null verdict →NO_IMPLEMENTATION) —src/lib/workflow/ready-gate.test.ts.- Policy thresholds, stagnation (
LOOP_NO_DIFF), budget/iteration caps — same suite.
So a regression that silently reports an empty branch as ready (the #529/#570 / #534 class) fails CI independently of this offline backtest.