Analytic Rigor & Evaluation: How Do You Know It Works?¶
Provenance. Synthesized 2026-06-02 from a deep-research pass on the brief's core challenge, specification & validation ("how do you know what works?") plus governance (quality, security, data integrity). Primary-source-backed; the harness's adversarial-verify stage was unreliable this run, so load-bearing claims were re-checked against their primary sources by hand. Decisions carry
# research(2026-06):notes; unverified leads are flagged as such.
At a glance
The brief's core challenge is specification and validation: proving the analytic product is right when there is no single correct answer. Ariadne answers it on three threads:
- Citation groundedness: every claim is cited (recall) and entailed by its source (precision), checked by an NLI model.
- Tradecraft compliance: ICD-203/206 estimative language and structured analysis, enforced by a lint.
- Eval harness: planted-needle scoring plus an LLM rubric, so a weak model fails cleanly and a strong one can be certified.
Why this exists¶
Phase 1's citation gate proved only that the agent did not fabricate a source
(every [cite:gN] resolved to a ledger entry). It did not detect uncited
claims, the SKILL.md guarantee that "a note with an uncited claim fails
validation" was unenforced, and citations are recorded per query, not per
fact. This pass grounds the fix and the Phase 4 eval harness.
Thread 1, Citation groundedness (the gate)¶
The metric vocabulary is settled by ALCE (Gao et al., EMNLP 2023,
arXiv:2305.14627): # research(2026-06):
- Citation recall: is every claim supported by a cited source? → catches the uncited-claim hole.
- Citation precision: does each cited source actually entail the claim (not merely resolve as an id)? → catches the entailment hole.
Both are computed by an NLI/entailment model between claim and cited passage.
Stage 1, coverage / recall ✅ SHIPPED (2026-06-02)¶
provenance/citations.find_uncited_claims + CitationReport.uncited; a note now
fails validation if any asserted claim is uncited. Hermetic, no new dependency.
Section-aware (Gaps & caveats and Provenance are exempt per the note template);
segment-granular (a trailing citation covers its bullet/paragraph, so only prose
after a segment's last citation is flagged, this matches how the agent
actually cites and avoids false-positiving the real Halberd bridge bullet).
Known limit: structural recall only, it does not yet check entailment, and a
sentence sandwiched before a trailing citation is assumed covered. That is
Stage 2.
Stage 2, entailment / precision ✅ FRAMEWORK SHIPPED (2026-06-02)¶
EntailmentVerifier protocol + find_unsupported_claims + CitationReport.unsupported,
injected into validate_citations(note, ledger, verifier=...), optional, so the
default path stays hermetic (unit-tested via a fake verifier). The real
HHEMVerifier (Vectara HHEM-2.1-Open, lazy-imported) lives behind the optional
eval extra with a gated integration test. Only the cited portion of a segment
is entailment-checked (trailing uncited prose is Stage 1's job). Remaining:
validate HHEM on a hedged-claim set, optionally wire a CLI --entail flag.
Candidate entailment models, with a runnable-as-a-CI-gate verdict:
| Model | Base / size | License | Gate verdict |
|---|---|---|---|
HHEM-2.1-Open (via RAGAS FaithfulnessWithHHEM) |
small T5 | open | ✅ hermetic, CPU, cheapest local path |
| MiniCheck-Flan-T5-Large | 770M | Apache-2.0 | ✅ GPU; "best <1B, reaches GPT-4" on LLM-AggreFact; sentence-level (decompose first) |
| AlignScore-large | 355M RoBERTa | open | ✅ fastest (~0.18s/ex) |
| LIM-RA | ~350M DeBERTa | open | ✅ best accuracy/size (SummaC bal-acc 78.5 vs AlignScore 74.0) |
| Bespoke-MiniCheck-7B | ~8B | commercial-gated | ⚠️ skip for a permissive repo |
| RAGAS-default / FActScore / ALCE-TRUE-11B | frontier judge | (varies) | ⚠️ integration-tier (key-gated) only |
# research(2026-06): ALCE precision/recall + MiniCheck/HHEM entailment.
Lead: HHEM-2.1-Open for the hermetic gate, frontier-judge variant key-gated
like the existing live test. The gap nobody solved: uncited-claim
detection, Stage 1 above is that missing piece (decompose → coverage), now built.
Load-bearing caveat. These NLI models are trained on factual entailment; analytic notes use estimative/hedged language ("likely", "assessed with moderate confidence"). Off-the-shelf entailment may misjudge hedged claims, validate against a small hedged-claim set before trusting Stage 2. Bridges to Thread 2.
Thread 2, Tradecraft compliance¶
- ICD-203 (IC analytic standards;
fas.org PDF) defines the WEP
probability bands and mandates a split Ariadne does not yet make: likelihood
of an event ≠ confidence in the basis for the judgment. A compliant note
needs both axes.
# research(2026-06): - LLMs are measurably miscalibrated on estimative language (arXiv:2405.15185: GPT-3.5/4 WEP distributions diverge from humans on 11-12 of 12 standard terms), so enforce an explicit WEP→numeric-band mapping as a lint, don't assume "likely" means what ICD-203 says.
- LLM-RUBRIC (arXiv:2501.00274), a
calibrated multidimensional automated-eval method; the path to scoring ICD-203
as a rubric rather than a vibe check. ✅ SHIPPED 2026-06-04 (the deployable,
pointwise subset; see ADR-0011
and
evaluation/rubric.py). Scores the four ICD-203 standards the mechanical gates cannot see (alternatives / argumentation / relevance / accuracy), criterion-separated, anchored 1-5, judge-bias-mitigated. The full method's human-calibration network is deferred until an annotated set exists. - AgentCDM (arXiv:2508.11995) operationalizes
Analysis of Competing Hypotheses (ACH) as multi-agent scaffolding with
evidence matrices. Caveat: cognitive-science ACH, not the IC/ICD-203 variant,
direction, not method. Bigger idea: ACH could reshape
entity-workupfrom "synthesize a note" to "enumerate competing hypotheses → marshal cited evidence for/against each."
Thread 3, Eval harness for the four success criteria¶
The planted Compound-Alpha bridge in the seed graph is latent ground truth;
the literature supplies the metrics. # research(2026-06):
- Multi-hop answer + supporting-evidence F1 is the standard, MuSiQue (arXiv:2108.00573) (2-4 hop, engineered against shortcuts) and HotpotQA supporting-fact F1 (directly transferable to "did the note cite the right edges").
- Trajectory eval ("did it traverse vs. guess"), AgenticRAGTracer
(arXiv:2602.19127) grades the reasoning
trajectory, not just the answer.
provenance.jsonlalready records the path (g11/g12/g13 hit the bridge), so a note naming the bridge with no ledger entry traversing it = a guess and should fail. - GraphRAG-specific: GraphRAG-Bench (arXiv:2506.02404, ICLR'26); and Beyond RAG for CTI is the same HRAG/AGRAG paper already cited in best-practice-architecture.md , retrieval and eval research converge on it.
Planted-needle design against the fixture: encode the bridge as
{answer: "Halberd↔Wren co-location", required_path: [MEMBER_OF, CO_LOCATED,
CO_LOCATED], required_cites: [...]}, then score each run on (1) recall (surfaced
it?), (2) trajectory (provenance actually traversed the path?), (3) precision
(bridge facts cited to the right queries?), (4) pivot-burden proxy (hop/query
count to reach it).
Decisions this pass produces¶
- Citation gate v2 Stage 1: uncited-claim detection (recall). Shipped
2026-06-02.
# research(2026-06): ALCE citation recall. - Citation gate v2 Stage 2: entailment (precision) via HHEM-2.1-Open;
framework shipped 2026-06-02, surfaced on the CLI as
workup --entail2026-06-04 (estimative claims are routed to the calibration lint, not HHEM). - Tradecraft lint ✅ SHIPPED (2026-06-02),
provenance/tradecraft.pylint_estimative_language: flags non-standard estimative hedges, maps used WEP terms to their ICD-203 band, detects the analytic-confidence axis. Advisorytradecraft.jsonartifact. Remaining: LLM-RUBRIC scoring; have theentity-workupskill prompt the agent to use WEP terms + state confidence (the real note currently uses neither). - Phase 4 eval harness ✅ SHIPPED (2026-06-02),
evaluation/needle.pyscore_workup+HALBERD_FIXTURE+ariadne eval <dir>: scores recall, trajectory (traversed vs guessed),grounded(both), and pivot-burden against the planted Compound-Alpha needle. The real Phase-1 Halberd workup scoresgrounded=True(recall 1.0, trajectory 1.0, 14 queries). Extended with per-edge supporting-fact F1 + a cross-store needle (wren-tie), and with cross-store reconciliation scoring (2026-06-04,evaluation/reconcile.py,ariadne eval --reconcile): grades whether a note corroborated cross-store agreements and flagged conflicts (fact surfaced + reconciliation language + both stores queried). A live two-store Halberd workup scoredreconciliation=1.00. Remaining: more fixtures; governance checks. - ACH-structured
entity-workup(2026-06-04), the note template now has an Alternatives-considered (analysis of competing hypotheses) section and the skill directs a brief ACH on the decisive finding; measured to lift the ICD-203 rubricalternativesscore. (AgentCDM-style multi-agent ACH remains a bigger, optional bet.)
Key sources¶
- ALCE, arXiv:2305.14627 (citation precision/recall)
- MiniCheck, github/Liyan06/MiniCheck · AlignScore, ACL 2023 · LIM-RA, arXiv:2404.06579 · RAGAS faithfulness, docs · FActScore, EMNLP 2023
- ICD-203, fas.org PDF · estimative-uncertainty calibration, arXiv:2405.15185 · LLM-RUBRIC, arXiv:2501.00274 · AgentCDM, arXiv:2508.11995
- MuSiQue, arXiv:2108.00573 · AgenticRAGTracer, arXiv:2602.19127 · GraphRAG-Bench, arXiv:2506.02404