Skip to content

Corpus & Caching

Why corpus stuffing instead of RAG

Anthropic's Contextual Retrieval guidance is explicit:

If your knowledge base is smaller than 200,000 tokens (about 500 pages of material), you can include the entire knowledge base in the prompt with prompt caching, which makes this approach significantly faster and more cost-effective.

Per-domain, our corpus sits at:

Domain Markdown chars (after stripping) Tokens (Opus) Tokens (Sonnet) Tokens (Haiku)
Visa 48,426 18,000 18,000 13,000
Claude 1,621,458 540,000 540,000 399,000
HackerRank 1,449,137 580,000 580,000 388,000

All three fit Opus 4.7's and Sonnet 4.6's 1M-token context comfortably. Visa fits everywhere. Haiku 4.5's 200K context can't fit Claude or HR — that's why the default model is Sonnet 4.6.

How the corpus is shaped

code/corpus.py does three transformations:

1. Strip YAML frontmatter

Every doc page in the starter's data/ ships a frontmatter block (title, source URL, last-modified timestamp, breadcrumbs). The model doesn't need any of it.

2. Strip image markup

_RE_IMG_MD   = re.compile(r"!\[[^\]]*\]\([^)]*\)")
_RE_IMG_HTML = re.compile(r"<img\b[^>]*/?>", re.IGNORECASE)

Both markdown ![alt](https://signed-url-with-200-tokens-of-params) and HTML <img src="../..." class="kb-image" /> get replaced with [image]. Signed image URLs alone were 200+ tokens each. Stripping them saved 153 KB on HR.

3. Strip dev-facing HR subdirs

HR_EXCLUDE_SUBDIRS = {"integrations", "library"}

integrations/ is API integration docs (Greenhouse, Lever, Workable connectors) and library/ is question-authoring docs — neither is relevant for customer-support triage. Excluding them dropped HR by 512K tokens, well under the 1M context cap with headroom for system prompt + ticket + tool schema + output.

How caching engages

Every API call sends two cached blocks in the system slot:

system_blocks = [
    {
        "type": "text",
        "text": SYSTEM_PROMPT,                                # ~5K tokens
        "cache_control": {"type": "ephemeral", "ttl": "1h"},
    },
    {
        "type": "text",
        "text": f"<corpus>\n{corpus_blob}\n</corpus>",        # 18K-580K tokens
        "cache_control": {"type": "ephemeral", "ttl": "1h"},
    },
]

With the anthropic-beta: extended-cache-ttl-2025-04-11 header, the cache TTL is 1 hour instead of the 5-minute default Anthropic shipped on March 6, 2026. The 1h variant costs 2× input on cache writes (vs 1.25× on the 5-min variant) but reads stay at 0.1× — pays off after about two reads on the same prefix within an hour, which is trivially true within a 4-minute production batch.

Cost shape per domain (Opus 4.7, sync)

For the production batch run on 29 unlabeled tickets:

Domain Tickets Cache write (1× $18.75/MT × 1h-multiplier 2/1.25) Cache reads Output tokens Total
Visa 6 18K → ~$0.34 5 × 18K × $1.50/MT = $0.13 minimal ~$0.50
Claude 7 540K → ~$10.13 6 × 540K × $1.50/MT = $4.86 minimal ~$15.00
HackerRank 14 580K → ~$10.88 13 × 580K × $1.50/MT = $11.31 minimal ~$22.20
None 2 no corpus minimal ~$0.10

Sync total: ~$38. With --batch (50% off across all token types): ~$19. The actual production batch closed at $20 measured in the Anthropic Console — close to the prediction.

Group-by-domain dispatch

main.py groups tickets by company before running:

by_company: dict[str | None, list[tuple[int, TicketInput]]] = defaultdict(list)
for i, t in enumerate(rows):
    by_company[normalize_company(t.company)].append((i, t))

Then iterates each domain in turn, so all tickets sharing a corpus prefix arrive at the API back-to-back. The first call writes the cache; the rest read it. Original input order is preserved in the output by writing results[i] indexed by the original CSV row.

For --batch mode the same flatten-then-dispatch pattern applies: all 29 requests go in one batch, and the batch processor still gets cache hits because each request's cache_control prefix is identical within a domain group. Wall-time was ~4 minutes in practice.

What we did not do

  • No vector index. Skipped Pinecone, Weaviate, Qdrant, pgvector — none of which would help at this corpus size.
  • No chunking heuristics. The <doc path="..."> wrapper is the only structure we add; the model gets the whole file.
  • No retrieval call inside the agent. submit_triage is the only tool. There is no search_corpus or lookup_url.