Ariadne — design: dataset-agnostic sensemaking pipeline¶

Status: design, awaiting review. Authored 2026-06-03. Driver: demo feedback — prove the harness handles sensemaking on real corpora beyond the synthetic Neo4j graph, and make adding a dataset easy for future devs. Target corpora: Enron (public) and Avocado / LDC2015T03 (license + PII gated). Source research (June 2026): canonical-data-model integration pattern; pgvector ≤10M-vector guidance; hybrid (BM25 + vector) retrieval; deterministic email-header graphing. Recorded inline below.

Goal¶

One pipeline, many datasets. A future dev adds a corpus by writing one adapter that maps raw data to a small canonical schema; a dataset-agnostic indexer fans those records into the standard stores; the agent, entity-workup skill, connectors, and eval harness do not change. Litmus test: write an adapter, register it, run ariadne workup <entity> --dataset <name>.

Decisions (research-grounded)¶

D1 — Canonical schema + adapter contract + registry¶

# research(2026-06): The canonical-data-model pattern replaces N×M point-to-point translation with two adapters per source (to/from a shared model). Justified here (3+ datasets, explicit extensibility goal); the documented anti-case is one-off prototypes, which this is not. Guard against the two named pitfalls: the "god model" (keep the core minimal; dataset-specific fields live in attributes dicts, never widen the core records) and ownership/versioning (the schema is owned here; bump a schema_version on change). Sources: Canonical Data Model guide (2026), Enterprise Integration Patterns.

D2 — Hybrid retrieval, full-text first; pgvector for the semantic leg¶

# research(2026-06): Hybrid (BM25-style full-text + dense vectors, fused with RRF) beats either alone, and full-text is essential for the exact identifiers that dominate email entity lookup (addresses, names, codenames). Full-text is the must-have first leg — it needs no embedding step, so PII content never leaves the box to be embedded (Avocado/air-gap win). pgvector adds the semantic leg: 2026 consensus is "pgvector if you already run Postgres and are under 10M vectors"; Enron ≈500K emails sits an order of magnitude inside that. Both legs live in the one access-controlled Postgres store already built. Escape hatch when scale grows: pgvectorscale (StreamingDiskANN) or a dedicated store (Qdrant). Reconciles with Anthropic's "default to agentic search, add semantic only when needed." → ADR-0007. Sources: hybrid BM25+vector (2026), full-text for RAG, vector DB benchmarks 2026.

D3 — Email → graph from headers deterministically; LLM body-extraction deferred¶

# research(2026-06): The communication graph (who→whom, when) comes from email headers — deterministic, cheap, reliable. LLM entity extraction is "biased toward frequent entities" and adds cost/error for what headers give free; it is a later semantic-enrichment enhancement over bodies, not the backbone. Sources: LLM-powered enterprise KGs.

D4 — Governance lives at the canonical layer¶

An adapter declares access ∈ {public, restricted}. A restricted adapter (Avocado) reads only from a gitignored local path, never commits or pushes, and is gated behind an explicit authorized-access flag. Whether PII content may reach a cloud model is flagged for the cloud-vs-air-gap fork, not solved here. Putting governance at the canonical choke point means every new dataset inherits it. → a future governance ADR (Phase C; number assigned when written).

Canonical schema (the contract — keep minimal)¶

Record	Fields	Lands in
`Entity`	`id`, `type`, `name`, `aliases`, `attributes: dict`	graph node
`Relationship`	`src`, `dst`, `type`, `attributes: dict`	graph edge
`Document`	`id`, `text`, `source_entity_ids`, `metadata: dict`, `modality`	full-text + vector (text); relational (metadata)
`Attribute`	`entity_id`, `key`, `value`	relational row

Entity.type is open (person, org, unit, topic, …): person-centric is the v1 primary entity, but topics/events are addable later with no schema change.

Adapter contract¶

class DatasetAdapter(Protocol):
    name: str                                  # registry key / --dataset value
    entity_type: str                           # primary entity, e.g. "person"
    access: Literal["public", "restricted"]
    def load(self) -> Iterable[Canonical]: ...        # Entity|Relationship|Document|Attribute
    def eval_fixtures(self) -> list[NeedleFixture]: ... # known-answer needles

Registered in DATASETS = {...}, selected by --dataset (same idiom as the existing FIXTURES registry).

Architecture¶

ariadne workup <entity> --dataset <name>
   │
   ├─ adapter.load()  ──►  canonical records (Entity/Relationship/Document/Attribute)
   │                         │  (restricted adapters: local-only, access-gated — D4)
   ├─ indexer  ──►  Neo4j (Entity+Relationship)
   │                Postgres: full-text + pgvector (Document.text), metadata+Attribute
   │
   └─ agent loop (UNCHANGED): gather → act → verify → synthesize → cited note + ledger + eval

Components¶

Unit	Path	Responsibility	Depends on
canonical schema	`src/ariadne/datasets/canonical.py`	the four record dataclasses + `schema_version`	— (pure)
adapter contract + registry	`src/ariadne/datasets/base.py`	`DatasetAdapter` Protocol + `DATASETS` registry + lookup	canonical
indexer	`src/ariadne/datasets/indexer.py`	fan canonical records into the stores; idempotent; per-dataset namespace	store clients
synthetic adapter	`src/ariadne/datasets/synthetic.py`	wrap the existing seed graph as the first adapter (proves the seam)	canonical
enron adapter	`src/ariadne/datasets/enron.py`	HF `corbt/enron-emails` → canonical (headers→graph, body→Document)	canonical, HF datasets
avocado adapter	`src/ariadne/datasets/avocado.py`	local PST/export → canonical; `access="restricted"`	canonical, local data
hybrid retrieval	`src/ariadne/unstructured/`	full-text + pgvector search exposed to the agent (via Postgres MCP)	Postgres
eval fixtures	per-adapter `eval_fixtures()`	dataset-specific known-answer needles	needle harness

Email → canonical mapping (Enron adapter)¶

Each message → Entity(person) for sender + each recipient; Relationship(EMAILED, src=sender, dst=recipient, attributes={count, first_seen, last_seen}).
Body → Document(text=body, metadata={subject, ts, folder}, source_entity_ids=[…]).
Account facts (display name, folder owner) → Attribute rows.
No LLM in the v1 path (D3).

Governance (Avocado, D4)¶

Raw Avocado data lives under a gitignored path (reuses the existing data/ ignore); never committed, never pushed.
The restricted adapter refuses to run unless an explicit authorized flag/env is set (e.g. ARIADNE_ALLOW_RESTRICTED=1).
Malware caveat (loveletter in ~27 msgs): the adapter ingests text/headers only; attachments are not executed.
PII-to-cloud-model handling is recorded as an open fork question, not resolved.

Testing strategy¶

Unit (hermetic): canonical record validation; registry lookup; indexer fan with a fake store; Enron mapping over a few recorded raw messages → expected canonical records; restricted-adapter access gate (refuses without the flag); hybrid-retrieval ranking over a fixture corpus (full-text + vector legs, fusion).
Integration (-m integration): Enron sample loaded into testcontainers Neo4j + Postgres(+pgvector); a live workup --dataset enron behind a key check surfaces a known Enron tie with citations; eval --fixture scores it grounded.

Build order (each independently shippable)¶

A — Abstraction: canonical schema + contract + registry + indexer; refactor the existing synthetic graph into the first adapter (no new data, proves the seam).
B — Enron: HF adapter + hybrid retrieval connector (full-text first, then pgvector) → generalization + tri-modal sensemaking on real data.
C — Avocado: restricted adapter with access-control governance; built now, populated when licensed data is provided.

Out of scope (later)¶

LLM body entity-extraction; cross-dataset entity resolution; the cloud-vs-air-gap PII fork; subagent fan-out (deferred, ADR-0005).

Success criteria (done = all true)¶

ariadne workup <entity> --dataset synthetic reproduces today's behaviour through the new adapter path (no regression).
--dataset enron loads a real email slice and produces a cited note surfacing a known communication tie; eval scores it grounded.
The Avocado adapter exists, is restricted, and refuses to run without the authorized flag (no data required to test the gate).
Adding a dataset touches only a new adapter file + its eval fixtures.
make lint + make test-unit green; integration green with a key.
ADR-0007 (hybrid retrieval) written; dataset-governance ADR follows in Phase C.