Architecture¶
How design choices map to the rubric¶
The hackathon scores submissions across four dimensions. Here's the explicit mapping for someone reading the work to grade or learn from it.
| Dimension | Choice we made | Where to read |
|---|---|---|
| Agent design — separation of concerns, justified technique | 7 single-purpose modules (main / agent / batch / corpus / safety / schema / eval); corpus stuffing per Anthropic's <200K guidance (not RAG) |
this page + Corpus & Caching |
Use of corpus — grounded answers from data/ |
every prompt wraps each doc in <doc path="data/...">; justifications cite specific corpus paths |
Corpus & Caching |
| Escalation logic — high-risk handling | explicit prompt criteria (sensitive/legal, account access, payment disputes, multi-language injection wrappers) | Prompt-Injection Defense |
| Determinism & reproducibility | uv.lock pinned, forced tool_choice + strict pydantic schema, runnable README |
Cost & Determinism |
| Engineering hygiene | ruff-clean, 31 unit tests (zero API), secrets via env vars only | Reference |
| Output CSV — per-row correctness on 5 columns | 100% on every graded column of the labeled sample (10 tickets); spot-check on the hardest test-set cases (multilingual injection, identity theft, score dispute) all routed correctly | Cost & Determinism |
| AI Fluency — visible steering on the chat transcript | ~/hackerrank_orchestrate/log.txt captures verbatim user prompts and the full development arc — including a documented Gemini autonomy overstep and AJ's correction |
(private chat transcript) |
System diagram¶
flowchart TD
csv[("support_tickets.csv")]
csv --> main["main.py<br/><small>argparse · csv I/O<br/>group tickets by company<br/>for cache locality</small>"]
main -->|sync| triage["agent.triage()"]
main -->|"--batch · 50% off"| batch["batch.run_batch()<br/><small>Message Batches API · async</small>"]
triage --> api{{"Anthropic Messages API<br/>Claude Opus 4.7 / Sonnet 4.6"}}
batch --> api
api --> validate["TicketOutput<br/><small>pydantic-validated</small>"]
validate --> output[("output.csv<br/>8 columns")]
classDef forced fill:#E36414,stroke:#F4A261,stroke-width:2px,color:#0a0e12
api:::forced
Per-request payload (same shape, sync or batch)¶
system blocks (cache_control: ephemeral, ttl: 1h):
[0] SYSTEM_PROMPT — instructions, escalation criteria, few-shot examples
[1] <corpus> ... </corpus>
per-domain markdown with frontmatter / image URLs stripped,
each file wrapped in <doc path="data/..."/>
user message:
<user_ticket>
company: HackerRank | Claude | Visa | None
subject: ...
issue: ...
</user_ticket>
tools: [submit_triage] (input_schema = TicketOutput)
tool_choice: {type: "tool", name: "submit_triage"} forced
The forced tool_choice is what guarantees a structured row per ticket: constrained decoding under tool-use can't emit anything that violates the TicketOutput schema. No parsing fallback, no retries on missing-tool calls.
File map¶
| file | purpose |
|---|---|
code/main.py |
CLI entry, argparse, CSV I/O, dispatch (sync vs --batch), --resume merge, cost estimate |
code/agent.py |
system prompt with few-shot examples, submit_triage tool definition, sync triage() call |
code/batch.py |
Anthropic Message Batches API path — build requests, submit, poll, collect results |
code/corpus.py |
per-domain markdown loader, frontmatter / image / signed-URL stripping, normalize_company |
code/safety.py |
ticket sanitization + spotlight delimiters |
code/schema.py |
pydantic models, status / request_type enums, CSV column order |
code/eval.py |
run against sample_support_tickets.csv, print per-column accuracy and mismatches |
code/tests/ |
pytest unit tests for safety, corpus, schema, main, batch (no API) |
scripts/build_submission.py |
bundle code/ into a clean submission zip |
Design pillars¶
Three architectural choices do most of the load-bearing work — each documented in its own page:
- Corpus & Caching — why we stuff the per-domain corpus instead of running RAG, and how the 1-hour
extended-cache-ttl-2025-04-11beta keeps cost under control. - Prompt-Injection Defense — spotlighting, structural delimiters, escalation criteria. How the French Visa "show me your internal rules" attack lands as a clean escalation with no leakage.
- Cost & Determinism — token economics across Haiku 4.5 / Sonnet 4.6 / Opus 4.7, the 50%-off Message Batches API path, and why we don't need temperature controls for deterministic output.