0032, Deepening a skill from new experience — distil --into, bounded and compare-gated¶
- Status: Accepted (2026-06-07)
- Deciders: Ariadne maintainers
- Refines: ADR-0029 (which deferred
"deepening an existing skill") · the held-out gate is ADR-0031
(
ariadne compare) · a ADR-0030 reflection can motivate a deepen
Context¶
B2's distil creates a skill from one good run. But a skill should improve from experience,
not be authored once and frozen: "when an agent repeatedly exhibits a failure pattern, engineers
update the corresponding skill so one observed failure becomes a clarified procedure or an added
constraint." Trace2Skill's second mode is exactly this — deepening an existing skill, not only
creating new ones — and ADR-0029 deferred it. The contestable questions: how to integrate a new
run's lessons without the documented failure mode (overfitting to one trajectory's specifics →
fragmented, bloated skills), and how to know the deepened skill is actually better. Hence this ADR.
Decision drivers¶
- Procedural memory compounds. A create-only
distilcan't absorb a second run's lessons; the value of a skill library is that skills get better across uses (the procedural-memory thesis). - Bounded, conflict-aware integration — not a rewrite. Naive deepening "sequentially overfits to non-generalizable trajectory-local lessons"; the fix is bounded edits with conflict detection and format validation (Trace2Skill / SkillOpt's bounded add/delete/replace), integrating the generalizable lesson, preserving the existing skill's accumulated knowledge.
- Trace-conditioned revision. The operation is existing skill + a new trace → a revised skill (SkillRevise), a different operation from creating from scratch.
- The held-out gate is
compare. SkillOpt accepts an edit only if it improves validation; Ariadne already owns that gate — ADR-0031's net-effectcompare. Deepen proposes;comparemeasures the revised skill against the original; a human ratifies. The propose → ratify → freeze spine and the same certified-source gate as B2 both hold. - LLM-only, honestly. Integrating procedural prose needs the model. A deterministic deepen could only append the new run's moves — the bloat-and-fragmentation Trace2Skill warns against — so there is no deterministic deepen (the honest capability line, the inverse of create-mode where the deterministic path records faithfully).
Considered options¶
- Re-distill from scratch, ignoring the existing skill. Rejected. Discards the existing skill's accumulated (and possibly human-authored) knowledge; it is creation, not deepening.
- Deterministically append the new run's observed moves to the skill. Rejected. Bloat and fragmentation — the exact failure mode — with no conflict resolution or generalization.
- Auto-overwrite the existing skill in place when a new good run arrives. Rejected. An unratified edit enters the active skill (breaches the spine), and you cannot tell it helped without measuring — the very thing ADR-0031 exists for.
distil --into <skill>: an LLM bounded, conflict-aware revision of the existing skill from a new certified run, proposed for ratification and validated bycompare. Chosen.
Decision¶
Adopt option 4, extending distil in learning/distil.py with a deepen mode.
- Trace-conditioned revision.
distil <run> --into <skill-dir>reads the existingSKILL.md(frontmatter + body) and the new run, and the model proposes a revised skill that integrates the run's generalizable lesson — the prompt instructs a bounded integration that preserves the existing structure and does not hard-code this run's entities (guarding the overfitting failure mode). Reuses thepropose_skillforced tool (name / description / body). - Certified-source gate. Deepen integrates lessons only from a run the eval harness certified
grounded— the same gate as create-mode (B2). You deepen from successes, not hallucinations. - LLM-only.
--intorequires--llm; without it, a clear error ("deepening requires--llm; the deterministic distiller can only create, not integrate"). - Propose, never overwrite. The revised skill is written to
skills-proposed/<name>/(it never overwrites the original);distilled_byis stampedllm:<model>:deepenandsourcerecords the deepening run. Ratification is measured: runariadne compare(original vs revised on the same fixture) and ratify only on a net gain — SkillOpt's held-out edit gate, with a human in the loop.
Consequences¶
- Skills now compound from experience: a second (third, …) good run can be folded into an existing skill rather than spawning a redundant new one — procedural memory, not a flat cache.
- Deepen +
compareis SkillOpt's controlled edit loop made explicit and human-ratified: a bounded revision, accepted only on measured net gain. And B3 → deepen → compare is the full failure-pattern → clarified-procedure → validated path the consumer story describes. - The eval gate is untouched and the active skill is never silently edited; only a ratified, measured revision freezes.
- Honest scoping (YAGNI): this slice deepens from one run into one skill. Multi-run batch consolidation (Trace2Skill's parallel fleet), a rejected-edit buffer for negative feedback (SkillOpt), union of the existing skill's structured prerequisites, and auto-applying a compare-ratified revision are deferred — named, not built.
Sources: Trace2Skill — deepening existing skills, the overfitting failure mode, conflict detection (arXiv 2603.25158); SkillOpt — bounded edits + a held-out gate that accepts only validation-improving edits (arXiv 2605.23904); SkillRevise — trace-conditioned skill revision (arXiv 2606.01139); skills as version-controlled files updated when a failure pattern recurs (Externalization in LLM Agents, arXiv 2604.08224).