The pricing page tells you cost per million tokens. Your actual bill is determined by retry rate, cache hit rate, prefill amortization, and how often a large model handles work a small one could have. Provider pricing has dropped 80% in two years, but customers' bills haven't dropped 80% — because the gap between advertised and effective cost is where the engineering lives.
TL;DR — The Five Multipliers
| Multiplier | Typical impact |
|---|---|
| Tier mismatch (large model handling trivial work) | 2-5× |
| Retry overhead (transient failures, parse errors) | 1.1-1.3× |
| Prefill amortization (system prompt on every call) | 1.2-2× |
| Multi-tool round-trips (agent loops) | 1.5-3× |
| Missed cache hits | 1.2-1.5× |
A naive deployment stacks all five. A well-engineered one neutralizes most. The same workload costs 5-10× as much under bad defaults as under good ones — at identical quality.
Provider Prices in Mid-2026
The headline rates for the most-deployed production models, normalized to USD per million tokens:
| Model | Input | Output | Cached input |
|---|---|---|---|
| Claude Haiku 4.5 | $0.80 | $4.00 | $0.08 |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 |
| Claude Opus 4.7 | $15.00 | $75.00 | $1.50 |
| OpenAI GPT-4.1-mini | $0.40 | $1.60 | $0.10 |
| OpenAI GPT-4.1 | $2.50 | $10.00 | $0.625 |
| OpenAI GPT-5 | $12.00 | $48.00 | $3.00 |
| Google Gemini 2.5 Flash | $0.30 | $1.20 | $0.075 |
| Google Gemini 2.5 Pro | $2.00 | $8.00 | $0.50 |
| Mistral Large 2 | $2.00 | $6.00 | $0.50 |
| Llama 3.3 70B (Groq) | $0.59 | $0.79 | n/a |
Cached input pricing is the most consequential pricing line item that customers ignore. A workload with a 4k-token system prompt and 30% cache hit rate pays roughly half what the headline rate suggests for the input side.
Multiplier 1 — Tier Mismatch
The most common waste pattern: a single large model handles every request. Half of those requests are routing decisions, classification, or simple Q&A that a $0.40/M model would have answered indistinguishably.
python# Anti-pattern: every request hits the expensive model response = await opus.complete(user_message)
python# Pattern: classify cheaply, escalate only when needed intent = await haiku.classify(user_message) if intent.tier == "trivial": return await haiku.complete(user_message) if intent.tier == "factual": return await sonnet.complete_with_rag(user_message) return await opus.complete(user_message) # only ~15% of traffic
The classifier itself costs a few cents per million tokens. The savings on the 85% of traffic that no longer hits the expensive model dominate that overhead by orders of magnitude.
In production, well-tuned tier routing typically takes 40-60% off total spend at unchanged quality.
Multiplier 2 — Retry Overhead
Retries are silent. They show up in the bill, not in the metrics.
Sources of retries in a typical agent stack:
- JSON parse failures (model returned malformed structure).
- Schema validation failures (Pydantic / Zod rejection).
- Provider 429 / 5xx (retried by the SDK by default).
- Guardrail-triggered regeneration.
- Eval-failed regeneration in strict mode.
The fix is not "retry less" — it is making retries unnecessary:
- Structured output mode (provider-native JSON mode, function calling, or guided generation) collapses parse-failure retries near zero.
- Idempotency in tool calls collapses double-billing from network retries.
- Failover routing between providers smooths through 5xx without re-prompting.
A workload with naive retries spends 10-30% more than the same workload with structured output and idempotent tools.
Multiplier 3 — Prefill Amortization
Every request pays prefill on its system prompt. With prompt caching, you pay it 5-10× less for the cached portion. Without prompt caching, you pay it in full.
Imagine a chat product with a 6,000-token system prompt (rules, examples, retrieved memory) and a 500-token user message. Each request without caching costs 6,500 input tokens. With caching of the 6,000-token system prompt, each request costs roughly 6,000 × 0.1 + 500 = 1,100 cache-equivalent tokens — 5.9× cheaper for input.
What this means in practice: structure your prompts so the stable parts come first and dynamic parts come last. Cached prefix matching is order-sensitive on every major provider.
System prompt: [STABLE: rules, policies, examples] ← cached
Conversation history: [STABLE for the session] ← cached
User message: [DYNAMIC] ← not cached
Multiplier 4 — Multi-Tool Round-Trips
Agents that loop through tool calls pay for the entire conversation on every turn. A five-tool-call agent doesn't just pay 5× the single-turn cost — each turn carries the cumulative history.
Token usage in a 5-step agent loop with a 4k-token base context:
| Turn | Input tokens | Cumulative bill |
|---|---|---|
| 1 | 4,000 | 4,000 |
| 2 | 4,400 | 8,400 |
| 3 | 4,800 | 13,200 |
| 4 | 5,200 | 18,400 |
| 5 | 5,600 | 24,000 |
The fifth turn alone costs more than the first three combined. Levers that reduce this:
- Aggressive context projection. After each tool call, the orchestrator extracts the minimum information the next step needs and stores the rest in working memory accessible by tool, not by context.
- Step budget. Hard cap on max steps. Refuse cleanly above it.
- Sub-agent delegation. A "scratch agent" performs an expensive sub-task in its own context window and returns a 200-token summary instead of 3,000 tokens of working notes.
Multiplier 5 — Missed Cache Hits
Semantic caching at the edge (see the TTFT post) is as much a cost lever as a latency lever.
A 25% cache hit rate cuts inference cost by 25%, modulo the cache lookup overhead which is sub-millisecond. The hit rate compounds with tier routing: cache hits don't even hit the small model, never mind the large one.
Common hit-rate ranges by workload:
| Workload | Hit rate range |
|---|---|
| Support FAQ chatbot | 40-70% |
| General-purpose chat | 10-25% |
| Code assistant | 5-15% |
| Internal knowledge search | 25-50% |
Self-Hosting Math, Honestly
The "should we self-host?" question is mostly a volume question.
A single H100 replica running Llama 3.3 70B in 2026 sustains roughly 70-120 output tokens/second under tight quantization (FP8 or AWQ), or 6-10M output tokens/day at high utilisation.
| Provider | Cost per 1M output tokens | Break-even daily output for self-hosted H100 (~$3.5k/mo) |
|---|---|---|
| Claude Sonnet 4.6 ($15/M) | $15 | ~8M tokens/day |
| GPT-4.1 ($10/M) | $10 | ~12M tokens/day |
| Gemini 2.5 Pro ($8/M) | $8 | ~15M tokens/day |
| Claude Opus 4.7 ($75/M) | $75 | ~1.6M tokens/day |
Two caveats:
- Self-hosting costs include operations engineering. Budget 0.3-0.5 FTE per node fleet at minimum.
- Self-hosted quality at 70B parameters trails frontier APIs on hard reasoning. For routine retrieval-augmented Q&A it is indistinguishable; for multi-step reasoning it lags.
A common production pattern: self-host the bulk tier (cheap, high-volume retrieval Q&A) on Llama 70B; API the top tier (complex reasoning) on Claude Opus or GPT-5. Bills frequently end up 60% lower than the all-API equivalent at unchanged quality.
Per-Tenant Budgeting
The single most-skipped cost control: enforced per-tenant budgets.
A buggy agent loop in a free tier customer's account, with no per-tenant cap, can burn a month's margin in a weekend. We have seen $4,000 spent in 28 hours from a single misconfigured user.
The control is simple and non-optional:
pythonasync def enforce_budget(tenant_id: UUID, tokens_about_to_spend: int): used = await redis.get(f"budget:{tenant_id}:{today()}") cap = await db.get_tenant_cap(tenant_id) if used + tokens_about_to_spend > cap: raise BudgetExceeded(used=used, cap=cap) await redis.incrby(f"budget:{tenant_id}:{today()}", tokens_about_to_spend, ttl=86400 * 2)
Caps are tier-based (free / pro / enterprise) and updated by sales when a customer's deal includes a higher allowance. The cap is enforced before the inference call, not after.
A Realistic Cost-Reduction Roadmap
If you are looking at an uncomfortable LLM bill, the order of operations that gets the most savings the fastest:
- Prompt caching — one engineering day, 30-50% input cost cut.
- Tier routing — one engineering week, 40-60% total cost cut.
- Per-tenant budgets — two engineering days, prevents future surprises.
- Structured outputs to kill parse retries — two days, 10-15% cost cut.
- Semantic caching at the edge — one week, 15-30% additional cost cut.
- Context projection in agent loops — one to two weeks, 20-40% cost cut on agent workloads.
- Self-hosting for the bulk tier — one quarter project, justified above ~8M output tokens/day.
Stack the first six and most teams see 70-85% total bill reduction without measurable quality changes.
Frequently Asked Questions
Why is my actual LLM bill 3x what the pricing page implied?
The pricing page shows cost per million tokens. Production cost adds retry overhead (10-30%), cache misses, system prompt prefill on every request, multi-tool round-trips, and tier mismatch. Each multiplies the base cost.
What's the best single lever to reduce LLM cost?
Multi-tier routing with a small model first. Sending 60% of traffic to a model 10x cheaper than your fallback typically cuts total spend 40-60% with no measurable quality loss when the classifier is well-tuned.
Is prompt caching worth the engineering effort?
Yes for any production workload with a stable system prompt over ~1,000 tokens. Payback is typically under one week.
When does self-hosting break even versus API providers?
For a single Llama 3.3 70B replica on an H100, break-even versus Claude Sonnet 4.6 lands around 8M output tokens/day or roughly 2,500 sustained DAU on a chat workload. Below that, APIs win; above it, self-hosting wins decisively.
How accurate are these numbers six months from now?
Provider prices fall roughly 30-50% per year. The ratios between models stay broadly stable; the absolute numbers will be lower. Re-run the math quarterly.
Key Takeaways
- Effective cost is 2-5x the pricing-page number once retries, cache misses, prefill, and tier mismatch are included.
- Multi-tier routing is the single biggest cost lever; prompt caching is a close second.
- Self-hosting breaks even around 8M output tokens/day per replica; below that, APIs are cheaper.
- Per-tenant budgeting is non-negotiable. Without it, one runaway loop burns a month's margin in a weekend.
- Stack the top six levers and you typically see 70-85% bill reduction at unchanged quality.
Why is my actual LLM bill 3x what the pricing page implied?
The pricing page shows cost per million tokens. Production cost adds retry overhead (10-30%), cache misses, system prompt prefill on every request, multi-tool round-trips, and tier mismatch (large model when small would do). Each multiplies the base cost.
What's the best single lever to reduce LLM cost?
Multi-tier routing with a small model first. Sending 60% of traffic to a model 10x cheaper than your fallback typically cuts total spend 40-60% with no measurable quality loss when the classifier is well-tuned.
Is prompt caching worth the engineering effort?
Yes for any production workload with a stable system prompt over ~1,000 tokens. OpenAI, Anthropic, and Google all charge 50-90% less for cached prefill tokens. Payback is typically under one week.
When does self-hosting break even versus API providers?
For a single Llama 3.3 70B replica on an H100, break-even versus Claude Sonnet 4.6 lands around 8M output tokens/day or roughly 2,500 sustained DAU on a chat workload. Below that, APIs win; above it, self-hosting wins decisively.
