Skip to content

0018, Multimodal connector slate — text / audio / relational shipped, video deferred

  • Status: Accepted (2026-06-05)
  • Deciders: Ariadne maintainers

Context

To demonstrate Ariadne across heterogeneous data of different kinds, shapes, and sizes — and to stress the dataset-agnostic adapter seam (ADR-0006) — we add HuggingFace-backed dataset connectors beyond the synthetic seed. The goal is breadth that proves the sensemaking thesis (entities, cross-source reconciliation, provenance), not breadth for its own sake.

Decision drivers

  • Entity-rich, on-purpose data. A connector earns its place by exercising entities / relationships / cross-store reconciliation, not by checking a modality box.
  • Agentic-to-text (ADR-0008). Every modality is reasoned over as text, so a dataset that ships text (transcript / caption / description) proves a modality without a live ASR/VQA model.
  • Lean ingestion + auditable. Prefer HF streaming (parquet) or cache-aware download; deterministic mapping (no LLM), like the enron transform.
  • Quality over checkbox. A forced, mismatched dataset lowers project quality.

Decision

Ship a four-kind slate, three connectors now:

Kind Connector Source Access
Documents (text) enron corbt/enron-emails HF stream
Speech (audio) worldspeech disco-eth/WorldSpeech HF stream (transcript; ADR-0008)
Relational (structured) lahman NeuML/baseballdata HF cache-aware CSV download

Video is deferred, not dropped. Surveying the most-downloaded HF video datasets, the field is dominated by robotics manipulation (LeRobot / DROID / bridge / fractal / behavior / RoboTwin / EBench), sign-language gesture sets, VLM training corpora (LLaVA-OneVision), test/junk repos, and QA benchmarks (MSR-VTT, Video-MME) — none entity-rich for intelligence sensemaking. MSR-VTT (generic captioned clips) was explicitly rejected. Per ADR-0008, video would be a fourth modality but the same mechanism WorldSpeech already proves (non-text sensory → text → reason), so forcing a mismatched video set would lower quality for no analytic gain.

Landing criteria (any video connector must meet all, so this is not a forever-defer): (1) entity-rich — named people / orgs / events; (2) HF-streamable or cache-aware-downloadable; (3) ships text annotations (transcript / caption / description) so agentic-to-text needs no live model; (4) acceptable license. The likely source is broadcast-news / hearing transcripts found via full-text search, not the download charts.

Consequences

  • Three real connectors (text / audio / relational) exercise documents, speech, and a clean shared-key relational schema — covering distinct shapes and sizes and the entity-resolution path (ADR-0016).
  • The adapter seam makes the video connector a drop-in when a dataset meets the criteria; no architecture change is owed.
  • Honest framing: Ariadne is demonstrated multimodal (text + audio + relational) today; "video" is a documented, criteria-gated next step rather than a shipped box-check.

Sources

  • Enron / WorldSpeech / Lahman dataset cards on the HF Hub (linked from the adapters in src/ariadne/datasets/).
  • HF "most downloads" video listing (2026-06): robotics / gesture / training / benchmark dominated — surveyed during selection.