Claude Prompt Caching

How Caching Works

Prompt caching stores a computed prefix on Anthropic's servers for reuse.
The first request with a cache_control breakpoint writes the cache — it costs slightly more.
Subsequent requests that match the same prefix read from cache — they cost 10% of normal input price.
Cached prefixes expire after 5 minutes of inactivity — reset by any request that hits them.
Cache is per-model and per-API-key — different models or keys never share a cache.
Minimum cacheable size: 1,024 tokens (Haiku 4.5 requires 2,048).

Add Cache Breakpoints

Add {"type": "ephemeral"} to any content block's cache_control field.

message = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "<large system prompt or document here>",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Summarise the key points."}]
)

In TypeScript, the structure is identical.

const message = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 1024,
    system: [{
        type: "text",
        text: largeDocument,
        cache_control: { type: "ephemeral" },
    }],
    messages: [{ role: "user", content: "What are the key risks?" }],
});

You can add breakpoints at multiple positions — up to 4 cache breakpoints per request.

# Example: cache system prompt AND a large tool schema separately
system = [{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}]
tools = [
    {"name": "search", "description": "...", "input_schema": {...}, "cache_control": {"type": "ephemeral"}},
]

Place breakpoints after the longest stable prefix — everything up to the mark is cached.
Do not add cache_control to content that changes between requests — it will never hit.

Verify Cache Hits

Check usage in the response to see cache write and read token counts.

print(message.usage.cache_creation_input_tokens)  # tokens written to cache
print(message.usage.cache_read_input_tokens)       # tokens read from cache
print(message.usage.input_tokens)                 # uncached input tokens

A cache miss on the first request: cache_creation_input_tokens > 0, cache_read_input_tokens == 0.
A cache hit on later requests: cache_read_input_tokens > 0, cache_creation_input_tokens == 0.
If cache_read_input_tokens stays at 0, the prefix likely changed or fell below the minimum size.
Price breakdown for prompt caching (per million tokens):

Token type Opus 4.8 price

Normal input $5.00

Cache write $6.25 (1.25×)

Cache read $0.50 (0.10×)

Output $25.00

Token type	Opus 4.8 price
Normal input	$5.00
Cache write	$6.25 (1.25×)
Cache read	$0.50 (0.10×)
Output	$25.00

Caching Strategies

Cache large, stable documents — legal text, codebases, long system prompts — before each query.

# Load a 10,000-token document once; each subsequent question reuses the cache
system = [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}]
for question in user_questions:
    msg = client.messages.create(model="claude-opus-4-8", max_tokens=512,
                                 system=system, messages=[{"role": "user", "content": question}])

Cache tool definitions for apps with large, fixed tool sets — tool schemas count toward the prefix.

In multi-turn conversations, cache the growing conversation history to reduce re-processing costs.

# Mark the last assistant turn with cache_control to cache everything above it
messages[-1]["cache_control"] = {"type": "ephemeral"}

Keep the dynamic part (user query) after the last breakpoint — it is never cached.
Re-issue the same request every 4 minutes to refresh the 5-minute TTL before cache expiry.

Tip: Measure cache_read_input_tokens in your logs to confirm caching is active before optimising further.

Warning: If you modify the system prompt between requests — even by one token — the cache misses and you pay the write cost again.