Citation-recall coverage hardening — design¶
- Date: 2026-06-07
- Status: Accepted (autonomous session; research-grounded, no approval gate)
- ADR: 0022
Problem¶
Live workup runs produce excellent analytic notes but exit 1 on the citation
gate. Two real runs on this machine (synthetic/Halberd, enron/vince.kaminski) each
failed recall with uncited claims. Inspecting the actual citations.json +
note.md artifacts shows two distinct failure modes, not one:
-
Genuine coverage gaps (6 of 7 flagged claims). The agent's synthesis / ACH sentences — the decisive-finding restatement, the "Favoring H1…" verdict, trailing corroboration judgments — land in their own paragraphs or as flat trailing assertions without carrying the
[cite:gN]of their basis. The evidence is cited in the supporting bullets just above; the synthesis sentence that rests on it is not. The skill already instructs against this (since547bc81, 2026-06-04) yet the agent still slips on secondary synthesis lines — a prompt-only nudge is demonstrably insufficient. -
A gate false-positive (1 of 7). The enron fragment "self-attributed forwards — and should not be read as 953 distinct inbound correspondents" is fully cited (
[cite:g26]earlier in the sentence). The recall gate's naive sentence splitter ((?<=[.!?])\s+) breaks on the period inside "i.e." and orphans the tail as a pseudo-sentence after the last citation. Analytic notes are abbreviation-rich (U.S.,e.g.,cf., decimals, titles), so this class will recur.
Research (2026-06)¶
- Generation-Time vs. Post-hoc Citation ([arXiv:2509.21557]): G-Cite (answer +
citations in one pass) prioritizes precision at the cost of coverage; P-Cite
(add/verify citations after drafting) achieves high coverage with competitive
correctness and moderate latency. Recommendation: P-Cite-first for high-stakes
applications. Ariadne is currently pure G-Cite, and our live runs show the
exact predicted signature — perfect precision (
dangling: []both runs), failing coverage. We are sitting on the documented weak spot of the approach we chose. - Multi-Stage Self-Verification ([arXiv:2509.05741]): a refinement stage where, "for each key factual statement, the LLM inserts corresponding citation sources from verified evidence." This is P-Cite as a refinement pass — our exact shape.
- Self-Refine ([arXiv:2303.17651]) and the broader self-correction literature:
"specificity collapses over iterations as models increasingly fail to
recognize correct answers" — naive LLM self-judgment loops degrade. Mitigation:
bound the iterations and use a deterministic terminator, not LLM self-judgment.
Ariadne already has a deterministic terminator (
find_uncited_claims), so the loop is gated by code, not by the model second-guessing itself. - PySBD ([arXiv:2010.09657]): rule-based, abbreviation-aware sentence boundary disambiguation; zero runtime dependencies, offline, deterministic; 97.92% on the Golden Rule Set (+25% over the next-best pure-Python tool). Preserves the gate's hermetic property. Reinventing it with a curated regex is the "expedient patch" the ROADMAP says to avoid.
Design¶
Two contained changes, shipped together because the live runs revealed both.
A. Abbreviation-robust segmentation (fixes the false-positive class)¶
Swap the naive _SENTENCE_SPLIT_RE.split(line) inside
citations._iter_claim_segments for pysbd.Segmenter(language="en",
clean=False).segment(line) (one module-level segmenter, reused). Everything
downstream is unchanged: bullet/numbered-marker stripping still happens first, and
last_cited / is_judgment / caveat logic still operate per segment. This fixes
both recall (find_uncited_claims) and entailment (find_unsupported_claims),
which share the same iterator. clean=False preserves the [cite:gN] markers and
original spans verbatim.
B. Gate-driven P-Cite repair loop (fixes the coverage gaps)¶
After the agent emits its G-Cite draft, run a bounded, deterministic-gate-driven post-hoc citation pass:
note = <agent draft> # G-Cite, precision-first (unchanged)
report = validate_citations(note, ledger)
for _ in range(MAX_REPAIR_PASSES): # bounded (default 2)
if report.ok or not report.uncited: # deterministic terminator
break
note = await repair_citations(note, ledger, report.uncited, call_llm=…)
report = validate_citations(note, ledger)
- New module
src/ariadne/provenance/repair.py: build_repair_prompt(note, ledger_entries, uncited) -> str— pure, unit-testable. Gives the model the full draft (so it sees where eachgNwas already cited in the supporting bullets), the ledger asgN → tool_input + response_excerpt, and the flagged sentences, with one rule: for each flagged sentence, append the[cite:gN](s) from the ledger whose evidence supports it (typically the cites already on the bullets it summarizes); if no ledger entry supports it, soften it to a calibrated ICD-203 judgment or drop it. Change nothing else; never invent agNabsent from the ledger.async def repair_citations(note, ledger, uncited, *, call_llm) -> str— builds the prompt, calls the injectedcall_llm, returns the revised note.call_llmis injected so the core is hermetically testable with a fake (mirrors the existingEntailmentVerifier/ predicate injection style).cli.pywiring:build_repair_options(model)— a tool-lessClaudeAgentOptions(no MCP servers, noentity-workupskill, repair-specific system prompt) so the pass cannot retrieve more or wander; it only rewrites text.- Production
call_llmwraps the SDKquery(...)with those options at the run's profilemodel; assembles the returnedTextBlocks. run_workupgains arepair: bool = Trueparam; the loop sits betweenvalidate_citationsandwrite_outputs. The persistednote.md,report, metrics, and manifest reflect the post-repair note.- CLI flag
--repair / --no-repair(default repair on).--no-repairis an eval lever — it measures raw G-Cite coverage vs. the repaired result, which is exactly the citation-recall metric the eval roadmap wants.
Why a loop terminated by the deterministic gate (not pure prompt, not LLM self-judge)¶
- Pure prompt (status quo): tried since 2026-06-04, fails live. G-Cite's coverage ceiling is structural per the research.
- LLM self-judgment loop: degrades over iterations (Self-Refine finding).
- Gate-driven P-Cite loop: the deterministic
find_uncited_claimsdecides when to stop, so there is no self-judgment degradation; the LLM only attaches cites it can ground in the ledger. Bounded passes cap latency/cost; the pass only fires when a draft actually has gaps (conditional cost). This is the high-stakes-appropriate P-Cite-first recommendation, adapted to the deterministic gate we already own.
Components & boundaries¶
| Unit | Responsibility | Depends on |
|---|---|---|
citations._iter_claim_segments |
segment note into claim sentences (now pysbd) | pysbd |
repair.build_repair_prompt |
render the P-Cite repair instruction (pure) | — |
repair.repair_citations |
one post-hoc pass via injected call_llm |
ledger, call_llm |
cli.run_workup loop |
bounded gate→repair→re-gate; persist repaired note | both above |
cli.build_repair_options / call_llm |
tool-less SDK call at profile model | claude-agent-sdk |
Data flow¶
agent draft (G-Cite) → validate_citations → if uncited & repair on →
build_repair_prompt → call_llm → revised note → validate_citations → loop ≤2 →
final report → write_outputs + manifest + workup_exit_code.
The provenance ledger is never mutated by repair (no new tool calls); the pass
only attaches existing gN or softens prose, so governance (read-only audit) and
dangling-cite precision are unaffected, and validate_citations re-checks dangling
on every pass.
Testing (TDD, hermetic)¶
- Segmentation (red→green): new
test_citationscases — a fully-cited sentence containingi.e./e.g./U.S./ a decimal is not flagged uncited (currently red under the regex; green after pysbd). Keep all existing recall/ entailment tests green. build_repair_prompt(pure): asserts the prompt contains the flagged sentences, the ledgergN+ excerpts, and the "only existinggN/ soften otherwise" rule.- Repair loop (hermetic): a fake
call_llmthat appends the correct[cite:gN]→validate_citationsbecomesokwithin ≤2 passes; a fake that cannot fix → loop exits afterMAX_REPAIR_PASSESwith the claim still flagged (graceful exit 1, no infinite loop).--no-repairpath leaves the draft untouched. - Whole-repo
make lint+ full unit/smoke suite green before any live run.
Scope / YAGNI¶
- In: segmentation robustness; the P-Cite repair loop; the
--repaireval lever. - Out (YAGNI): constrained/structured decoding; a claim-assertion tool protocol;
Proof-Carrying-Numbers tokens for numeric claims ([arXiv:2509.06902] — the general
repair already covers numeric facts); repairing
dangling/unsupported(different failure modes, empirically absent live — repair targetsuncitedonly). - YAGNI exception justification: the repair loop is not speculative — it is the research-backed fix for a demonstrated, twice-reproduced failure that the cheaper prompt-only fix already failed to resolve.
Sources¶
- Generation-Time vs. Post-hoc Citation — https://arxiv.org/abs/2509.21557
- Multi-Stage Self-Verification (citation refinement) — https://arxiv.org/pdf/2509.05741
- Self-Refine: Iterative Refinement with Self-Feedback — https://arxiv.org/abs/2303.17651
- PySBD: Pragmatic Sentence Boundary Disambiguation — https://arxiv.org/abs/2010.09657
- Proof-Carrying Numbers (considered, rejected) — https://arxiv.org/pdf/2509.06902
- ALCE citation precision/recall (existing basis) — https://arxiv.org/abs/2305.14627