Technology · Claude API
IntermediateClaude Prompt Caching
A quick reference for adding cache control breakpoints to reduce latency and cost on repeated Claude prompts.
- 01Add cache_control breakpoints to large, reusable prompt prefixes.
- 02Cached tokens cost 10% of normal input price on re-use.
- 03Check usage.cache_read_input_tokens to confirm cache hits.
How Caching Works
Prompt caching stores a computed prefix on Anthropic's servers for reuse.
The first request with a
cache_controlbreakpoint writes the cache — it costs slightly more.Subsequent requests that match the same prefix read from cache — they cost 10% of normal input price.
Cached prefixes expire after 5 minutes of inactivity — reset by any request that hits them.
Cache is per-model and per-API-key — different models or keys never share a cache.
Minimum cacheable size: 1,024 tokens (Haiku 4.5 requires 2,048).
Add Cache Breakpoints
Add
{"type": "ephemeral"}to any content block'scache_controlfield.message = client.messages.create( model="claude-opus-4-8", max_tokens=1024, system=[ { "type": "text", "text": "<large system prompt or document here>", "cache_control": {"type": "ephemeral"} } ], messages=[{"role": "user", "content": "Summarise the key points."}] )In TypeScript, the structure is identical.
const message = await client.messages.create({ model: "claude-opus-4-8", max_tokens: 1024, system: [{ type: "text", text: largeDocument, cache_control: { type: "ephemeral" }, }], messages: [{ role: "user", content: "What are the key risks?" }], });You can add breakpoints at multiple positions — up to 4 cache breakpoints per request.
# Example: cache system prompt AND a large tool schema separately system = [{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}] tools = [ {"name": "search", "description": "...", "input_schema": {...}, "cache_control": {"type": "ephemeral"}}, ]Place breakpoints after the longest stable prefix — everything up to the mark is cached.
Do not add
cache_controlto content that changes between requests — it will never hit.
Verify Cache Hits
Check
usagein the response to see cache write and read token counts.print(message.usage.cache_creation_input_tokens) # tokens written to cache print(message.usage.cache_read_input_tokens) # tokens read from cache print(message.usage.input_tokens) # uncached input tokensA cache miss on the first request:
cache_creation_input_tokens > 0,cache_read_input_tokens == 0.A cache hit on later requests:
cache_read_input_tokens > 0,cache_creation_input_tokens == 0.If
cache_read_input_tokensstays at 0, the prefix likely changed or fell below the minimum size.Price breakdown for prompt caching (per million tokens):
Token type Opus 4.8 price Normal input $5.00 Cache write $6.25 (1.25×) Cache read $0.50 (0.10×) Output $25.00
Caching Strategies
Cache large, stable documents — legal text, codebases, long system prompts — before each query.
# Load a 10,000-token document once; each subsequent question reuses the cache system = [{"type": "text", "text": long_doc, "cache_control": {"type": "ephemeral"}}] for question in user_questions: msg = client.messages.create(model="claude-opus-4-8", max_tokens=512, system=system, messages=[{"role": "user", "content": question}])Cache tool definitions for apps with large, fixed tool sets — tool schemas count toward the prefix.
In multi-turn conversations, cache the growing conversation history to reduce re-processing costs.
# Mark the last assistant turn with cache_control to cache everything above it messages[-1]["cache_control"] = {"type": "ephemeral"}Keep the dynamic part (user query) after the last breakpoint — it is never cached.
Re-issue the same request every 4 minutes to refresh the 5-minute TTL before cache expiry.
Tip: Measure
cache_read_input_tokensin your logs to confirm caching is active before optimising further.
Warning: If you modify the system prompt between requests — even by one token — the cache misses and you pay the write cost again.