Corpus & Caching¶
Why corpus stuffing instead of RAG¶
Anthropic's Contextual Retrieval guidance is explicit:
If your knowledge base is smaller than 200,000 tokens (about 500 pages of material), you can include the entire knowledge base in the prompt with prompt caching, which makes this approach significantly faster and more cost-effective.
Per-domain, our corpus sits at:
| Domain | Markdown chars (after stripping) | Tokens (Opus) | Tokens (Sonnet) | Tokens (Haiku) |
|---|---|---|---|---|
| Visa | 48,426 | 18,000 | 18,000 | 13,000 |
| Claude | 1,621,458 | 540,000 | 540,000 | 399,000 |
| HackerRank | 1,449,137 | 580,000 | 580,000 | 388,000 |
All three fit Opus 4.7's and Sonnet 4.6's 1M-token context comfortably. Visa fits everywhere. Haiku 4.5's 200K context can't fit Claude or HR — that's why the default model is Sonnet 4.6.
How the corpus is shaped¶
code/corpus.py does three transformations:
1. Strip YAML frontmatter¶
Every doc page in the starter's data/ ships a frontmatter block (title, source URL, last-modified timestamp, breadcrumbs). The model doesn't need any of it.
2. Strip image markup¶
_RE_IMG_MD = re.compile(r"!\[[^\]]*\]\([^)]*\)")
_RE_IMG_HTML = re.compile(r"<img\b[^>]*/?>", re.IGNORECASE)
Both markdown  and HTML <img src="../..." class="kb-image" /> get replaced with [image]. Signed image URLs alone were 200+ tokens each. Stripping them saved 153 KB on HR.
3. Strip dev-facing HR subdirs¶
integrations/ is API integration docs (Greenhouse, Lever, Workable connectors) and library/ is question-authoring docs — neither is relevant for customer-support triage. Excluding them dropped HR by 512K tokens, well under the 1M context cap with headroom for system prompt + ticket + tool schema + output.
How caching engages¶
Every API call sends two cached blocks in the system slot:
system_blocks = [
{
"type": "text",
"text": SYSTEM_PROMPT, # ~5K tokens
"cache_control": {"type": "ephemeral", "ttl": "1h"},
},
{
"type": "text",
"text": f"<corpus>\n{corpus_blob}\n</corpus>", # 18K-580K tokens
"cache_control": {"type": "ephemeral", "ttl": "1h"},
},
]
With the anthropic-beta: extended-cache-ttl-2025-04-11 header, the cache TTL is 1 hour instead of the 5-minute default Anthropic shipped on March 6, 2026. The 1h variant costs 2× input on cache writes (vs 1.25× on the 5-min variant) but reads stay at 0.1× — pays off after about two reads on the same prefix within an hour, which is trivially true within a 4-minute production batch.
Cost shape per domain (Opus 4.7, sync)¶
For the production batch run on 29 unlabeled tickets:
| Domain | Tickets | Cache write (1× $18.75/MT × 1h-multiplier 2/1.25) | Cache reads | Output tokens | Total |
|---|---|---|---|---|---|
| Visa | 6 | 18K → ~$0.34 | 5 × 18K × $1.50/MT = $0.13 | minimal | ~$0.50 |
| Claude | 7 | 540K → ~$10.13 | 6 × 540K × $1.50/MT = $4.86 | minimal | ~$15.00 |
| HackerRank | 14 | 580K → ~$10.88 | 13 × 580K × $1.50/MT = $11.31 | minimal | ~$22.20 |
| None | 2 | no corpus | — | minimal | ~$0.10 |
Sync total: ~$38. With --batch (50% off across all token types): ~$19. The actual production batch closed at $20 measured in the Anthropic Console — close to the prediction.
Group-by-domain dispatch¶
main.py groups tickets by company before running:
by_company: dict[str | None, list[tuple[int, TicketInput]]] = defaultdict(list)
for i, t in enumerate(rows):
by_company[normalize_company(t.company)].append((i, t))
Then iterates each domain in turn, so all tickets sharing a corpus prefix arrive at the API back-to-back. The first call writes the cache; the rest read it. Original input order is preserved in the output by writing results[i] indexed by the original CSV row.
For --batch mode the same flatten-then-dispatch pattern applies: all 29 requests go in one batch, and the batch processor still gets cache hits because each request's cache_control prefix is identical within a domain group. Wall-time was ~4 minutes in practice.
What we did not do¶
- No vector index. Skipped Pinecone, Weaviate, Qdrant, pgvector — none of which would help at this corpus size.
- No chunking heuristics. The
<doc path="...">wrapper is the only structure we add; the model gets the whole file. - No retrieval call inside the agent.
submit_triageis the only tool. There is nosearch_corpusorlookup_url.