0007, Hybrid retrieval, full-text first (in-Postgres)¶
- Status: Accepted (2026-06-03)
- Deciders: Ariadne maintainers
- Supersedes / superseded by: none
Context¶
The canonical pipeline (ADR-0006) includes a Document leg for unstructured
evidence, email bodies, attachments, free-text records. That leg needs retrieval.
The full design spec
(docs/superpowers/specs/2026-06-03-multi-dataset-pipeline-design.md)
calls for hybrid lexical + semantic retrieval: a full-text (lexical) pass
and a vector-similarity (semantic) pass whose results are fused. The question
is sequencing, what to build first, and at what dependency cost.
Decision drivers¶
- Exact-identifier lookup dominates. Entity searches in the email corpora (Enron, Avocado) centre on names, addresses, codenames, and short phrases. Lexical recall on these terms is high; semantic search adds cost without proportional gain at Phase B1.
- PII must not require an embedding round-trip. Sending document text to a cloud embedding model exposes PII content to a third party before access control has been applied. The lexical leg requires no embedding step.
- Keep dependencies lean. Every new extension or service is a new failure mode for a pre-release project.
- Stay in the one access-controlled store (per ADR-0004). All evidence must
be auditable and traceable through the existing
postgres-mcpexecute_sqlinterface. A separate retrieval service would split the audit trail.
Considered options¶
A. Postgres built-in full-text search (chosen)¶
Generated tsvector column + GIN index + websearch_to_tsquery, queried via
the existing postgres-mcp execute_sql tool. Semantic leg (pgvector +
local embedding model) added later and fused via Reciprocal Rank Fusion (RRF).
- Pro: zero new dependency; no embedding step (PII content never leaves the
box at Phase B1); exact-term precision on names and addresses; the full-text
index is queryable through the same
execute_sqltool already in use. - Con: Postgres's built-in
ts_ranklacks BM25's IDF weighting, so ranking quality is lower than a BM25 extension for longer documents. Acceptable at Phase B1; addressable later via a named upgrade path.
B. Adopt a BM25 extension now (pg_textsearch or ParadeDB pg_search)¶
- Pro: better term-frequency / inverse-document-frequency ranking out of the box.
- Con: introduces a new Postgres extension before ranking quality is a
measured bottleneck; adds complexity earlier than the work justifies.
pg_textsearch(2026 C-native BM25) is the preferred future path when ranking quality matters, faster than ParadeDB and not Neon-deprecated, but it belongs in a named upgrade, not Phase B1.
C. Dedicated vector database¶
Rejected per ADR-0004: consolidate evidence into the one auditable Postgres store. A separate vector DB splits access control and breaks the provenance model the analytic product depends on.
Decision¶
Adopt option A. Ship a generated tsvector column, a GIN index, and
websearch_to_tsquery queries via the existing postgres-mcp execute_sql
tool. Named upgrade paths:
pg_textsearch: C-native BM25, faster than ParadeDB, not Neon-deprecated, when ranking quality on longer documents becomes a measured need.- pgvector + local embedding model: EmbeddingGemma-300M (primary; small footprint, no cloud call); BGE-M3 fallback, for the semantic leg, fused with the lexical leg by Reciprocal Rank Fusion. The semantic leg is an isolated follow-on that does not change the lexical schema.
Consequences¶
- Phase B1 ships the lexical leg with no embedding dependency and no new Postgres extension. PII document content does not leave the box.
- The semantic leg is an isolated, additive follow-on: add a
vectorcolumn, populate it with a local model, and join the two ranked lists via RRF. - All retrieval, both legs, stays inside Postgres and is queryable through
the same
execute_sqlinterface, preserving the single-store audit trail. - The embedding model is injectable via an
Embedderprotocol; the default is the ungatedbge-small-en-v1.5(384-dim, Apache-2.0), chosen over EmbeddingGemma-300m as the default because Gemma is license-gated (HF token friction for dev/CI); EmbeddingGemma-300m (768-dim) remains the documented higher-quality swap.