Chain-Mode Success Rate Analysis
Date: 2026-05-09
Issue: #604
Source data: .sequant/metrics.json (62 runs), .sequant/logs/run-*.json (96 logs)
Headline
Section titled “Headline”| Cohort | Runs | Success | Rate |
|---|---|---|---|
| Single-issue (any phases) | 17 | 13 | 76.5% |
| Single-issue (spec+exec+qa) | 15 | 12 | 80.0% |
--chain (all lengths) | 7 | 2 | 28.6% |
Multi-issue, no --chain | 38 | 20 | 52.6% |
The 29% headline from npx sequant stats reproduces (28.6% in raw data). Gap vs single-issue holds.
Note on the 92% vs 80% gap. The issue body cites
npx sequant statsreporting 92% single-issue success at the time it was filed (56 runs in the local sample). This analysis runs on a larger 62-run snapshot of.sequant/metrics.jsonand the doc’s 80% applies the explicitphases ⊇ {spec, exec, qa}filter.sequant statsitself counts every run withissues.length === 1(any phase set) — seesrc/commands/stats.ts:422. The headline 76.5% / 80% / 28.6% / 52.6% rates here are reproducible from the raw data; the 92% is a snapshot-in-time figure that does not reproduce against the current data and is treated as a directional, not load-bearing, baseline. (Independent recount of the spec+exec+qa cohort yields 11/14 = 78.6% with a strict--chainexclusion; the doc’s 12/15 = 80.0% filter includes one borderline single-issue run that may also have ridden in a chain — within rounding either way.)
Phase A — Signal Confirmation
Section titled “Phase A — Signal Confirmation”Sample size
Section titled “Sample size”7 chain runs over 4 months (2026-02-18 → 2026-04-26). Small but every failure is forensically traceable through .sequant/logs/.
Segmentation by chain length
Section titled “Segmentation by chain length”| Length | Runs | Success | Rate |
|---|---|---|---|
| 2 | 1 | 1 | 100% (anecdotal) |
| 3 | 4 | 1 | 25% |
| 4 | 2 | 0 | 0% |
Length-3+ aggregate: 1/6 = 17%. Failure rate scales with chain length.
“Success” filter. Throughout this doc and the runtime annotation in
assess-collision-detect.ts, “success” meansoutcome === "success"in.sequant/metrics.json. The 2 length-3 chains markedpartial(some issues completed before the chain halted) are counted with the failures here, not with the successes. Under a partial-counts-as-success framing the length-3+ rate would be 3/6 = 50% rather than 1/6 = 17%; we use the strict definition consistently so the headline (28.6%), the segmentation table, and the runtime annotation all align.
Per-issue attribution (counterfactual)
Section titled “Per-issue attribution (counterfactual)”The 28.6% whole-chain rate masks a bookkeeping artifact: when issue 1 in a chain fails, issues 2..N never execute and are counted as “chain failed.”
| View | Numerator / Denominator | Rate |
|---|---|---|
| Whole-chain success | 2 / 7 | 28.6% |
| Per-issue success (counting skipped as failed) | 8 / 22 | 36.4% |
| Per-issue success (only issues actually attempted) | 8 / 13 | 61.5% |
The per-issue success rate when chain mode actually reaches the issue (61.5%) is much closer to the single-issue baseline (80%) than the whole-chain headline suggests. Chain mode is not 3× worse at solving issues — its halt-on-first-failure semantics inflate the failure count.
Wall-clock duration
Section titled “Wall-clock duration”| Cohort | Avg duration |
|---|---|
| Single-issue (success) | 21.9 min |
| Chain (all) | 78.9 min |
| Failed chains | 54-228 min |
Long wall-clock duration is the mechanism behind failure mode #2 below.
Phase B — Failure Forensics
Section titled “Phase B — Failure Forensics”Each failed/partial chain traced through its log file:
| Run (id, date) | Issues | Length | Outcome | Stopped at | Root cause |
|---|---|---|---|---|---|
| f317530c · 2026-03-23 | 388,389,391,394 | 4 | failed | #388 | Claude Code 5h rate limit hit during quality-loop iteration after AC_NOT_MET |
| 058048af · 2026-03-25 | 438,439,440 | 3 | failed | #438 | Claude Code rate limit on first exec → 1800s timeout cascade |
| e7d3212d · 2026-04-06 | 484,483,485,486 | 4 | failed | #484 | Quality-loop saturated (3× AC_NOT_MET on issue 1) |
| b0349697 · 2026-04-26 | 554,555,556 | 3 | partial | #555 | QA phase API idle timeouts (1800s × 2) on second issue |
| c20fcde6 · 2026-03-12 | 254,304,307 | 3 | partial | #307 | AC_MET_BUT_NOT_A_PLUS treated as failure → retry loop → process exit 1 |
Top-3 failure modes
Section titled “Top-3 failure modes”| # | Mode | Count | Chain-specific? |
|---|---|---|---|
| 1 | First-issue retry exhaustion halts entire chain | 3/5 | Amplified by chain semantics — same root causes (AC_NOT_MET saturation, rate limit) would fail one issue in single-issue mode but kill all downstream issues here |
| 2 | Long wall-clock chain run trips Claude 5h rate limit | 2/5 | Structural — chain duration scales with length, raising rate-limit risk |
| 3 | Pre-existing per-issue failure modes (API timeouts, AC_MET_BUT_NOT_A_PLUS retry-loop bug) | 2/5 | No — would fail in single-issue runs too |
Chain-attributable rate
Section titled “Chain-attributable rate”Of 5 failed/partial runs, 0 have a root cause that is uniquely chain-mode-induced. All causes (rate limits, QA saturation, API timeouts, the AC_MET_BUT_NOT_A_PLUS bug) also appear in single-issue runs. Chain mode’s contribution is amplification, not new failure modes:
- A first-issue failure marks all downstream issues as failed (semantic amplification).
- Long chains accumulate wall-clock time, raising Claude rate-limit probability (mechanical amplification).
Phase C — Recommendation
Section titled “Phase C — Recommendation”Verdict: RESTRICT chain to length=2; warn for length≥3.
Rationale:
- Sample (n=7) is too small to deprecate outright.
- Length-2 success (1/1) is consistent with single-issue baseline; length-3+ (1/6 = 17%) is not.
- Per-issue success (61.5%) shows chain mode itself is not structurally broken at solving issues — its accounting and duration scaling are.
- Underlying failure modes (rate limits, AC_MET_BUT_NOT_A_PLUS retry loop) are tracked separately and may be fixed without revisiting chain mode.
/assess collision-detect change (implemented)
Section titled “/assess collision-detect change (implemented)”Existing rule at src/lib/assess-collision-detect.ts:267 (length-≥3 emit guard) is preserved. The chain suggestion now appends a historical-rate annotation:
Chain: npx sequant run A B C --chain --qa-gate -Q # alternative — N issues modify <path> (chain length≥3 historically 1/6 = 17%; see docs/reference/chain-mode-analysis-2026-05.md)The annotation route was chosen over a threshold flip (which would have suppressed the suggestion entirely at length≥3) because n=7 is small enough that hard-restricting based on it would over-fit the sample; surfacing the rate inline lets the user weigh chain mode against the parallel-mode default without removing the option. The three skill mirrors of predicted-collision-detection.md (.claude/skills/, templates/skills/, skills/) carry the same wording in lockstep, and src/lib/__tests__/assess-collision-detect.test.ts asserts both annotation substrings on every length≥3 collision. Re-evaluate (and consider flipping to a hard length === 2 threshold) when n≥20 chain runs accumulate.
Out of scope for this issue
Section titled “Out of scope for this issue”- The AC_MET_BUT_NOT_A_PLUS retry-loop bug is documented in
feedback_qa_break_vs_loop.mdand is a/qaskill defect, not a chain-mode defect. - Claude Code rate-limit handling (resume after limit reset) is a separate orchestrator concern.
docs/features/parallel-execution.md:126previously listed chain rate as50% n=4from the #452 era; refreshed to29% n=7in this PR.
Limitations
Section titled “Limitations”- n=7 chain runs. Verdict is consistent with the data but should be revisited at n≥20.
- Selection bias. Chains were chosen by the user for issues judged collision-prone — they may be inherently harder than the single-issue baseline.
- No A/B counterfactual. We don’t have parallel runs of the same issues in single vs chain mode.