Technology · Claude API

Intermediate

Claude Prompt Caching

A quick reference for adding cache control breakpoints to reduce latency and cost on repeated Claude prompts.

TL;DR
  1. 01Add cache_control breakpoints to large, reusable prompt prefixes.
  2. 02Cached tokens cost 10% of normal input price on re-use.
  3. 03Check usage.cache_read_input_tokens to confirm cache hits.

How Caching Works

  • Prompt caching stores a computed prefix on Anthropic's servers for reuse.

  • The first request with a cache_control breakpoint writes the cache — it costs slightly more.

  • Subsequent requests that match the same prefix read from cache — they cost 10% of normal input price.

  • Cached prefixes expire after 5 minutes of inactivity — reset by any request that hits them.

  • Cache is per-model and per-API-key — different models or keys never share a cache.

  • Minimum cacheable size: 1,024 tokens (Haiku 4.5 requires 2,048).

Add Cache Breakpoints

  • Add {"type": "ephemeral"} to any content block's cache_control field.

    message = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "<large system prompt or document here>",
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{"role": "user", "content": "Summarise the key points."}]
    )
    
  • In TypeScript, the structure is identical.

    const message = await client.messages.create({
        model: "claude-opus-4-8",
        max_tokens: 1024,
        system: [{
            type: "text",
            text: largeDocument,
            cache_control: { type: "ephemeral" },
        }],
        messages: [{ role: "user", content: "What are the key risks?" }],
    });
    
  • You can add breakpoints at multiple positions — up to 4 cache breakpoints per request.

    # Example: cache system prompt AND a large tool schema separately
    system = [{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}]
    tools = [
        {"name": "search", "description": "...", "input_schema": {...}, "cache_control": {"type": "ephemeral"}},
    ]
    
  • Place breakpoints after the longest stable prefix — everything up to the mark is cached.

  • Do not add cache_control to content that changes between requests — it will never hit.

Verify Cache Hits

  • Check usage in the response to see cache write and read token counts.

    print(message.usage.cache_creation_input_tokens)  # tokens written to cache
    print(message.usage.cache_read_input_tokens)       # tokens read from cache
    print(message.usage.input_tokens)                 # uncached input tokens
    
  • A cache miss on the first request: cache_creation_input_tokens > 0, cache_read_input_tokens == 0.

  • A cache hit on later requests: cache_read_input_tokens > 0, cache_creation_input_tokens == 0.

  • If cache_read_input_tokens stays at 0, the prefix likely changed or fell below the minimum size.

  • Price breakdown for prompt caching (per million tokens):

    Token type Opus 4.8 price
    Normal input $5.00
    Cache write $6.25 (1.25×)
    Cache read $0.50 (0.10×)
    Output $25.00

Caching Strategies

  • Cache large, stable documents — legal text, codebases, long system prompts — before each query.

    # Load a 10,000-token document once; each subsequent question reuses the cache
    system = [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]
    for question in user_questions:
        msg = client.messages.create(model="claude-opus-4-8", max_tokens=512,
                                     system=system, messages=[{"role": "user", "content": question}])
    
  • Cache tool definitions for apps with large, fixed tool sets — tool schemas count toward the prefix.

  • In multi-turn conversations, cache the growing conversation history to reduce re-processing costs.

    # Mark the last assistant turn with cache_control to cache everything above it
    messages[-1]["cache_control"] = {"type": "ephemeral"}
    
  • Keep the dynamic part (user query) after the last breakpoint — it is never cached.

  • Re-issue the same request every 4 minutes to refresh the 5-minute TTL before cache expiry.

Tip: Measure cache_read_input_tokens in your logs to confirm caching is active before optimising further.

Warning: If you modify the system prompt between requests — even by one token — the cache misses and you pay the write cost again.

Claude API BasicsClaude API Streaming