0011, LLM-rubric scoring of analytic standards (ICD-203)¶

Status: Accepted (2026-06-04)
Deciders: Ariadne maintainers
Relates to: ADR-0010 (rubric scores are a governance signal the observability layer can surface); the citation gate, the WEP tradecraft lint, and the planted-needle eval (the mechanical rigor checks this complements)

Context¶

The brief's central challenge is specification & validation: "how do you know what works?" Ariadne already answers part of it mechanically: the citation gate scores sourcing (recall + entailment), the tradecraft lint scores expression of uncertainty (ICD-203 WEP bands), and the planted-needle harness scores whether the analysis traversed the ground-truth path rather than guessing. Those are deterministic and hermetic.

But several ICD-203 analytic tradecraft standards are not mechanically checkable, they require a reader's judgment: did the note weigh alternatives, argue logically from its evidence, stay relevant to the analytic question, and keep its judgments proportionate to the evidence? A regex cannot score these. Without a measure of them, "does this analytic product meet tradecraft standards?" stays a vibe check, and note quality cannot be tracked across runs.

Decision drivers¶

Cover the standards the mechanical gates cannot: complement, do not duplicate, the citation/WEP/needle checks.
Calibrated and debuggable, not a single opaque score: a reviewer must see why a note scored low on a given standard.
Hermetic core: the scoring engine must run in CI with no model and no network; only the real judge needs an API key, like the live-agent e2e.
Longitudinal: usable as an overnight quality signal and an optional CI gate, so regressions in analytic quality surface.

Considered options¶

A. LLM-rubric, pointwise, criterion-separated (chosen)¶

A manually authored rubric, one dimension per uncovered ICD-203 standard, each with anchored 1-5 descriptors, scored pointwise by an LLM judge, one criterion at a time with structured (forced-tool) output, aggregated to an overall mean. The deployable subset of LLM-Rubric (Hashemi & Eisner, ACL 2024).

Pros:
Pointwise analytic rubrics are the documented best tool for "debugging and longitudinal monitoring" (vs pairwise, which suits model selection).
Criterion-by-criterion scoring is debuggable, each dimension's score + rationale shows exactly where a note fell short.
Anchored 1-5 levels are the per-level calibration the literature calls for; a narrow scale reduces score drift.
Injected-judge design keeps the engine pure: a fake judge makes the whole scoring/aggregation path hermetic; the real Claude judge sits behind the optional rubric extra.
Bias mitigations are explicit: the judge prompt scores quality not length (verbosity bias), and asks for rationale before score with a forced schema (format/anchoring bias).
Cons:
LLM judges are imperfectly calibrated to humans; this is a directional quality signal, not ground truth.
The full LLM-Rubric method trains a small calibration network on human annotations to predict the overall human score. Ariadne has no annotated set yet, so we ship the rubric-scoring subset (mean of dimensions) and treat the calibration network as a documented extension for when annotations exist.

B. Pairwise LLM-as-judge¶

Pros: more reliable per-comparison; the standard for ranking candidates.
Cons: needs both-orderings and quadratic calls; suits model selection, not scoring a single note against an absolute standard, which is the job here. Position bias is a first-order concern.

C. Single holistic "rate this note 1-10" prompt¶

Pros: cheapest; one call.
Cons: an opaque vibe score with no per-standard breakdown, exactly the failure mode this ADR exists to avoid. Maximally exposed to length/format bias.

D. No LLM eval, mechanical checks only¶

Pros: fully deterministic, no API.
Cons: leaves the judgment-requiring standards (alternatives, argumentation, relevance, accuracy) entirely unmeasured. Half the tradecraft picture.

Decision¶

Adopt A. evaluation/rubric.py holds the pure engine, ICD203_RUBRIC (four dimensions the mechanical gates do not cover, anchored 1-5), an AnalyticJudge Protocol, and score_note / score_note_dir. evaluation/judge.py holds ClaudeAnalyticJudge, behind the optional rubric extra (lazy anthropic import), scoring one dimension per forced submit_score tool call. ariadne rubric <dir> runs it (API-gated), informational by default, a pass/fail CI gate with --min.

The four scored dimensions, chosen because the mechanical layer cannot see them:

Dimension	ICD-203 standard	Why not mechanical
`alternatives`	#4 analysis of alternatives	Requires reading whether competing explanations were weighed
`argumentation`	#6 clear, logical argumentation	Requires judging whether reasoning follows from evidence
`relevance`	#5 customer relevance + implications	Requires judging the "so what"
`accuracy`	#8 accuracy of judgments	Requires judging whether claims overreach the evidence

(Sourcing #2 → citation gate; uncertainty #3 → WEP lint; fact-vs-judgment → tradecraft.is_analytic_judgment. Those stay mechanical.)

Consequences¶

The judgment-requiring ICD-203 standards become a per-dimension, debuggable, trackable score, closing the half of "how do you know it works?" the mechanical gates cannot reach.
The score is a CI-gateable governance signal (--min) and a candidate observability metric (ADR-0010).
The signal is directional: LLM-judge calibration is imperfect, and the human-calibration network is deferred until an annotated set exists. Treat rubric scores as relative/longitudinal, not absolute ground truth.
This also motivates a future entity-workup skill-prompt improvement: telling the agent it will be scored on these four standards should raise note quality (the real notes currently weigh few alternatives and state few implications).