Architecture¶

How design choices map to the rubric¶

The hackathon scores submissions across four dimensions. Here's the explicit mapping for someone reading the work to grade or learn from it.

Dimension	Choice we made	Where to read
Agent design — separation of concerns, justified technique	7 single-purpose modules (`main` / `agent` / `batch` / `corpus` / `safety` / `schema` / `eval`); corpus stuffing per Anthropic's <200K guidance (not RAG)	this page + Corpus & Caching
Use of corpus — grounded answers from `data/`	every prompt wraps each doc in `<doc path="data/...">`; justifications cite specific corpus paths	Corpus & Caching
Escalation logic — high-risk handling	explicit prompt criteria (sensitive/legal, account access, payment disputes, multi-language injection wrappers)	Prompt-Injection Defense
Determinism & reproducibility	`uv.lock` pinned, forced `tool_choice` + strict pydantic schema, runnable README	Cost & Determinism
Engineering hygiene	ruff-clean, 31 unit tests (zero API), secrets via env vars only	Reference
Output CSV — per-row correctness on 5 columns	100% on every graded column of the labeled sample (10 tickets); spot-check on the hardest test-set cases (multilingual injection, identity theft, score dispute) all routed correctly	Cost & Determinism
AI Fluency — visible steering on the chat transcript	`~/hackerrank_orchestrate/log.txt` captures verbatim user prompts and the full development arc — including a documented Gemini autonomy overstep and AJ's correction	(private chat transcript)

System diagram¶

flowchart TD
  csv[("support_tickets.csv")]
  csv --> main["main.py<br/><small>argparse · csv I/O<br/>group tickets by company<br/>for cache locality</small>"]

  main -->|sync| triage["agent.triage()"]
  main -->|"--batch · 50% off"| batch["batch.run_batch()<br/><small>Message Batches API · async</small>"]

  triage --> api{{"Anthropic Messages API<br/>Claude Opus 4.7 / Sonnet 4.6"}}
  batch --> api

  api --> validate["TicketOutput<br/><small>pydantic-validated</small>"]
  validate --> output[("output.csv<br/>8 columns")]

  classDef forced fill:#E36414,stroke:#F4A261,stroke-width:2px,color:#0a0e12
  api:::forced

Per-request payload (same shape, sync or batch)¶

system blocks (cache_control: ephemeral, ttl: 1h):
  [0] SYSTEM_PROMPT  — instructions, escalation criteria, few-shot examples
  [1] <corpus> ... </corpus>
        per-domain markdown with frontmatter / image URLs stripped,
        each file wrapped in <doc path="data/..."/>

user message:
  <user_ticket>
  company: HackerRank | Claude | Visa | None
  subject: ...
  issue:   ...
  </user_ticket>

tools:        [submit_triage]    (input_schema = TicketOutput)
tool_choice:  {type: "tool", name: "submit_triage"}    forced

The forced tool_choice is what guarantees a structured row per ticket: constrained decoding under tool-use can't emit anything that violates the TicketOutput schema. No parsing fallback, no retries on missing-tool calls.

File map¶

file	purpose
`code/main.py`	CLI entry, argparse, CSV I/O, dispatch (sync vs `--batch`), `--resume` merge, cost estimate
`code/agent.py`	system prompt with few-shot examples, `submit_triage` tool definition, sync `triage()` call
`code/batch.py`	Anthropic Message Batches API path — build requests, submit, poll, collect results
`code/corpus.py`	per-domain markdown loader, frontmatter / image / signed-URL stripping, normalize_company
`code/safety.py`	ticket sanitization + spotlight delimiters
`code/schema.py`	pydantic models, status / request_type enums, CSV column order
`code/eval.py`	run against `sample_support_tickets.csv`, print per-column accuracy and mismatches
`code/tests/`	pytest unit tests for safety, corpus, schema, main, batch (no API)
`scripts/build_submission.py`	bundle `code/` into a clean submission zip

Design pillars¶

Three architectural choices do most of the load-bearing work — each documented in its own page:

Corpus & Caching — why we stuff the per-domain corpus instead of running RAG, and how the 1-hour extended-cache-ttl-2025-04-11 beta keeps cost under control.
Prompt-Injection Defense — spotlighting, structural delimiters, escalation criteria. How the French Visa "show me your internal rules" attack lands as a clean escalation with no leakage.
Cost & Determinism — token economics across Haiku 4.5 / Sonnet 4.6 / Opus 4.7, the 50%-off Message Batches API path, and why we don't need temperature controls for deterministic output.