Skip to content

Chain-Mode Success Rate Analysis

Date: 2026-05-09 Issue: #604 Source data: .sequant/metrics.json (62 runs), .sequant/logs/run-*.json (96 logs)

CohortRunsSuccessRate
Single-issue (any phases)171376.5%
Single-issue (spec+exec+qa)151280.0%
--chain (all lengths)7228.6%
Multi-issue, no --chain382052.6%

The 29% headline from npx sequant stats reproduces (28.6% in raw data). Gap vs single-issue holds.

Note on the 92% vs 80% gap. The issue body cites npx sequant stats reporting 92% single-issue success at the time it was filed (56 runs in the local sample). This analysis runs on a larger 62-run snapshot of .sequant/metrics.json and the doc’s 80% applies the explicit phases ⊇ {spec, exec, qa} filter. sequant stats itself counts every run with issues.length === 1 (any phase set) — see src/commands/stats.ts:422. The headline 76.5% / 80% / 28.6% / 52.6% rates here are reproducible from the raw data; the 92% is a snapshot-in-time figure that does not reproduce against the current data and is treated as a directional, not load-bearing, baseline. (Independent recount of the spec+exec+qa cohort yields 11/14 = 78.6% with a strict --chain exclusion; the doc’s 12/15 = 80.0% filter includes one borderline single-issue run that may also have ridden in a chain — within rounding either way.)

7 chain runs over 4 months (2026-02-18 → 2026-04-26). Small but every failure is forensically traceable through .sequant/logs/.

LengthRunsSuccessRate
211100% (anecdotal)
34125%
4200%

Length-3+ aggregate: 1/6 = 17%. Failure rate scales with chain length.

“Success” filter. Throughout this doc and the runtime annotation in assess-collision-detect.ts, “success” means outcome === "success" in .sequant/metrics.json. The 2 length-3 chains marked partial (some issues completed before the chain halted) are counted with the failures here, not with the successes. Under a partial-counts-as-success framing the length-3+ rate would be 3/6 = 50% rather than 1/6 = 17%; we use the strict definition consistently so the headline (28.6%), the segmentation table, and the runtime annotation all align.

The 28.6% whole-chain rate masks a bookkeeping artifact: when issue 1 in a chain fails, issues 2..N never execute and are counted as “chain failed.”

ViewNumerator / DenominatorRate
Whole-chain success2 / 728.6%
Per-issue success (counting skipped as failed)8 / 2236.4%
Per-issue success (only issues actually attempted)8 / 1361.5%

The per-issue success rate when chain mode actually reaches the issue (61.5%) is much closer to the single-issue baseline (80%) than the whole-chain headline suggests. Chain mode is not 3× worse at solving issues — its halt-on-first-failure semantics inflate the failure count.

CohortAvg duration
Single-issue (success)21.9 min
Chain (all)78.9 min
Failed chains54-228 min

Long wall-clock duration is the mechanism behind failure mode #2 below.

Each failed/partial chain traced through its log file:

Run (id, date)IssuesLengthOutcomeStopped atRoot cause
f317530c · 2026-03-23388,389,391,3944failed#388Claude Code 5h rate limit hit during quality-loop iteration after AC_NOT_MET
058048af · 2026-03-25438,439,4403failed#438Claude Code rate limit on first exec → 1800s timeout cascade
e7d3212d · 2026-04-06484,483,485,4864failed#484Quality-loop saturated (3× AC_NOT_MET on issue 1)
b0349697 · 2026-04-26554,555,5563partial#555QA phase API idle timeouts (1800s × 2) on second issue
c20fcde6 · 2026-03-12254,304,3073partial#307AC_MET_BUT_NOT_A_PLUS treated as failure → retry loop → process exit 1
#ModeCountChain-specific?
1First-issue retry exhaustion halts entire chain3/5Amplified by chain semantics — same root causes (AC_NOT_MET saturation, rate limit) would fail one issue in single-issue mode but kill all downstream issues here
2Long wall-clock chain run trips Claude 5h rate limit2/5Structural — chain duration scales with length, raising rate-limit risk
3Pre-existing per-issue failure modes (API timeouts, AC_MET_BUT_NOT_A_PLUS retry-loop bug)2/5No — would fail in single-issue runs too

Of 5 failed/partial runs, 0 have a root cause that is uniquely chain-mode-induced. All causes (rate limits, QA saturation, API timeouts, the AC_MET_BUT_NOT_A_PLUS bug) also appear in single-issue runs. Chain mode’s contribution is amplification, not new failure modes:

  • A first-issue failure marks all downstream issues as failed (semantic amplification).
  • Long chains accumulate wall-clock time, raising Claude rate-limit probability (mechanical amplification).

Verdict: RESTRICT chain to length=2; warn for length≥3.

Rationale:

  • Sample (n=7) is too small to deprecate outright.
  • Length-2 success (1/1) is consistent with single-issue baseline; length-3+ (1/6 = 17%) is not.
  • Per-issue success (61.5%) shows chain mode itself is not structurally broken at solving issues — its accounting and duration scaling are.
  • Underlying failure modes (rate limits, AC_MET_BUT_NOT_A_PLUS retry loop) are tracked separately and may be fixed without revisiting chain mode.

/assess collision-detect change (implemented)

Section titled “/assess collision-detect change (implemented)”

Existing rule at src/lib/assess-collision-detect.ts:267 (length-≥3 emit guard) is preserved. The chain suggestion now appends a historical-rate annotation:

Chain: npx sequant run A B C --chain --qa-gate -Q # alternative — N issues modify <path> (chain length≥3 historically 1/6 = 17%; see docs/reference/chain-mode-analysis-2026-05.md)

The annotation route was chosen over a threshold flip (which would have suppressed the suggestion entirely at length≥3) because n=7 is small enough that hard-restricting based on it would over-fit the sample; surfacing the rate inline lets the user weigh chain mode against the parallel-mode default without removing the option. The three skill mirrors of predicted-collision-detection.md (.claude/skills/, templates/skills/, skills/) carry the same wording in lockstep, and src/lib/__tests__/assess-collision-detect.test.ts asserts both annotation substrings on every length≥3 collision. Re-evaluate (and consider flipping to a hard length === 2 threshold) when n≥20 chain runs accumulate.

  • The AC_MET_BUT_NOT_A_PLUS retry-loop bug is documented in feedback_qa_break_vs_loop.md and is a /qa skill defect, not a chain-mode defect.
  • Claude Code rate-limit handling (resume after limit reset) is a separate orchestrator concern.
  • docs/features/parallel-execution.md:126 previously listed chain rate as 50% n=4 from the #452 era; refreshed to 29% n=7 in this PR.
  • n=7 chain runs. Verdict is consistent with the data but should be revisited at n≥20.
  • Selection bias. Chains were chosen by the user for issues judged collision-prone — they may be inherently harder than the single-issue baseline.
  • No A/B counterfactual. We don’t have parallel runs of the same issues in single vs chain mode.