The Pattern We See Every Week
A team adopted GPT-4 or Claude for a specific business use case 14 months ago. They approved a $2,000/month budget based on initial usage estimates. The application worked well. Usage grew. Nobody rebuilt the architecture.
This quarter, the AI bill is $18,000/month. Finance wants to know why. Engineering wants to know what to cut. Everyone is frustrated because nobody can point to the moment the cost broke from the plan.
The honest answer: AI costs compound on three axes simultaneously. Request volume grows. Each request uses more tokens as context accumulates. And new use cases get layered in without cost accounting. Together, those three compound into a 5–10× bill increase within 18 months of production deployment — predictable, and almost always preventable with a quarterly cost architecture review.
The Four Levers, Ranked by Leverage
We rank these by savings-per-engineering-week because that's the metric that matters. Every lever works; the question is which one to reach for first.
Lever 1: Prompt Caching (Highest Leverage)
Every production LLM application has stable context that repeats across requests: the system prompt, tool definitions, few-shot examples, reference documentation. If your application sends 3,000 tokens of context on every request and processes 40,000 requests per day, you are billing 120 million input tokens per day of identical content.
Both Anthropic and OpenAI now support prompt caching. Anthropic's implementation (cache control markers) reduces cached input tokens to roughly 10% of the standard price after first-write. OpenAI's automatic caching is slightly different but comparable in economic effect.
What Makes Prompt Caching Work
- Stable prefix: cached content must be at the start of the prompt
- High hit rate: the same cached content must be reused frequently enough that the first-write premium is amortized
- Short TTL: caches expire (typically 5–10 minutes for most providers), so low-volume use cases don't benefit
For applications sending >500 requests/hour with stable context >1,000 tokens, prompt caching reduces the cached-portion token cost by 80–90%. On a typical production AI agent, this translates to a 40–60% total bill reduction with a one-day engineering investment.
If your team hasn't implemented prompt caching, do this first. Nothing else has comparable return on effort.
Lever 2: Batch Processing for Async Workloads
Both Anthropic and OpenAI offer batch APIs that run at 50% of real-time pricing. The tradeoff is latency: batch jobs complete within 24 hours, not in seconds.
Most teams assume their AI workload is real-time and dismiss batch. When we audit actual traffic, 30–60% of the workload is genuinely async — nightly report generation, bulk email drafts, document classification, customer feedback summarization, data enrichment pipelines. All of it could run on batch.
The Audit Question
For every AI workload in your system, ask: "What is the user-facing SLA?" If the answer is "by tomorrow morning" or "runs on a schedule," it's batch-eligible. Migrate it.
Migration cost: typically one engineer-week per workload. Savings: 50% of that workload's billed cost. Payback: usually inside the first month.
Lever 3: Model Routing
Not every request needs the flagship model. The economic pattern we see repeatedly: a team starts with GPT-4 (or Claude Opus) because it "just works," then never revisits the choice. A year later, 70% of their traffic is classification tasks, extraction, or formatting — workloads where a cheap model (GPT-4o-mini, Claude Haiku) performs identically at 10–20% the cost.
Building a Router
A model router is a small classifier that examines the incoming request and decides which model to use. It doesn't have to be sophisticated. Heuristics work:
- Short query, simple intent → cheap model
- Long context, multi-step reasoning → flagship model
- Ambiguous output from cheap model → fall back to flagship model
The router itself is 50–200 lines of code. The real work is the evaluation dataset — you need 200–500 representative examples scored for both models to validate where you can safely route down.
Typical outcome: 60–75% of traffic moves to a cheaper model, with no measurable quality degradation, cutting total cost by 40–55%. Engineering cost: 1–2 weeks including the eval work.
The routing anti-pattern: routing based on user segment or feature flag instead of request content. "Free users get the cheap model" produces predictable quality complaints. Route based on task complexity, not customer tier. If you want tiered service, make the tier a separate business decision, not a cost-hidden architectural one.
Lever 4: Fine-Tuning (Last, Not First)
Fine-tuning is the lever AI teams reach for first and should reach for last. It's glamorous, vendor-promoted, and has the highest upfront engineering cost with the most uncertain savings outcome.
Fine-tuning works well when:
- You have a narrow, well-defined task (classification, extraction, structured output)
- You have thousands of high-quality labeled examples
- You've exhausted the prior three levers
- You have evaluation infrastructure to monitor quality drift in production
Fine-tuning does not help when:
- Your task requires broad knowledge or open-ended reasoning
- Your data is not representative of production traffic
- You'll need to retrain every time your product requirements shift
The successful fine-tuning projects we've been part of have one thing in common: they came after prompt caching, batching, and routing had already cut the bill by 70%. Fine-tuning extracted the last 10–20%. That sequence is almost always the right one.
A Representative Audit
A client's monthly LLM spend in early Q1: $22,000/month on GPT-4o, mostly for a customer support copilot and document summarization pipeline.
Week 1: Prompt caching on the support copilot's system prompt and tool definitions → $22K to $13K.
Week 2: Batch API migration for the overnight summarization pipeline → $13K to $9K.
Week 3: Model routing in the support copilot — Haiku for classification and simple FAQ matches, Opus only for complex multi-turn cases → $9K to $6.8K.
Total engineering investment: three weeks. Savings: $15,200/month. Payback: under a month.
Fine-tuning was on the roadmap for Q2 but we recommended deferring — at $6.8K/month, the opportunity cost of the engineering time for additional savings was no longer competitive with other product work.
What Belongs in a Quarterly AI Cost Review
- Per-workload token usage breakdown (input vs output, cached vs uncached)
- Per-workload latency SLA (for batch eligibility analysis)
- Per-workload model choice and last quality eval date
- Top 10 prompts by volume and their cacheability status
- Any new workloads added since the last review
This review should take half a day if the instrumentation is in place. If it would take more than a day, the first engineering investment is observability — which is its own post.
The Bottom Line
AI costs do not stabilize on their own. The architectural patterns that work in a prototype do not work at production volume. A quarterly review, applied in the order above, keeps AI economics predictable through growth.
If your LLM bill is above $10,000/month and nobody has audited it this quarter, there is almost certainly 40–60% of that spend that isn't delivering value. That's the easiest cost reduction on your entire cloud bill.
Audit your AI spend?
We'll review your LLM workload architecture and return a prioritized cost reduction plan in 5–7 business days. No pitch.