LLM Cost Management: Batching, Caching, and Model Selection

The cost of running LLM-powered features in production can escalate quickly. A proof-of-concept that costs $20 per month during development can easily turn into $5,000 per month once real users start hitting it. We have seen this happen across multiple client deployments — the AI feature works beautifully, gets executive buy-in, goes to production, and then finance starts asking uncomfortable questions about the API bill.

The good news is that most LLM workloads have enormous optimization headroom. Across our AI consulting engagements, we routinely achieve 60–80% cost reductions without degrading output quality. The strategies fall into four categories: prompt caching, request batching, intelligent model routing, and token optimization. This article covers each one with concrete implementation details.

Understanding Where the Money Goes

Before optimizing, you need visibility. Every LLM API call has two cost components: input tokens (your prompt and context) and output tokens (the model's response). Output tokens are typically 3–5x more expensive per token than input tokens. This means a system that generates long, verbose responses is burning money on the most expensive part of the API.

Start by instrumenting your application to log every API call with its model name, input token count, output token count, latency, and a task category tag. Even a simple CSV log is enough to identify your top cost drivers. In most applications, 80% of the spend comes from 2–3 task types. Those are your optimization targets.

import time
import csv
from datetime import datetime

class LLMCostTracker:
    def __init__(self, log_path="llm_costs.csv"):
        self.log_path = log_path

    def log_call(self, model, task_type, input_tokens, output_tokens, latency_ms):
        # Pricing per 1M tokens (update these to current rates)
        pricing = {
            "claude-sonnet-4-6":   {"input": 3.00, "output": 15.00},
            "claude-haiku-4-5":    {"input": 0.80, "output": 4.00},
            "gpt-4o":              {"input": 2.50, "output": 10.00},
            "gpt-4o-mini":         {"input": 0.15, "output": 0.60},
        }
        rates = pricing.get(model, {"input": 3.00, "output": 15.00})
        cost = (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1_000_000

        with open(self.log_path, "a", newline="") as f:
            writer = csv.writer(f)
            writer.writerow([
                datetime.utcnow().isoformat(), model, task_type,
                input_tokens, output_tokens, f"{cost:.6f}", latency_ms,
            ])
        return cost

Run this for a week in production, then analyze the log. You will likely find that a large fraction of your spend is on repetitive tasks — the same system prompt sent thousands of times, similar documents being summarized repeatedly, or classification tasks where a smaller model would perform just as well.

Strategy 1: Prompt Caching

If your application sends the same system prompt or context prefix with every request, you are paying for those tokens over and over. Both Anthropic and OpenAI now offer prompt caching features that dramatically reduce costs for repeated prefixes.

Anthropic's prompt caching works by marking a portion of your prompt as cacheable. When subsequent requests share the same cached prefix, you pay a reduced rate on those tokens — typically 90% less than the standard input token price. The cache has a 5-minute TTL and refreshes on each hit, so it works well for high-throughput applications.

Key insight: Prompt caching delivers the biggest savings when your system prompt or few-shot examples are large (2,000+ tokens) and your request volume is high enough to keep the cache warm. For a customer support bot with a 3,000-token system prompt handling 100+ requests per hour, caching alone can cut input token costs by 80%.

The implementation is straightforward. Structure your prompts so the static portion comes first, followed by the dynamic user input:

import anthropic

client = anthropic.Anthropic()

# The system prompt and few-shot examples are cached
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,  # 3000+ tokens of instructions
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": user_query},  # Dynamic per request
    ],
)

# Check cache performance in the response
usage = response.usage
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Regular input tokens: {usage.input_tokens}")

For applications that include document context (like RAG pipelines), you can cache the retrieved documents between turns in a multi-turn conversation. This is especially valuable when the user asks follow-up questions about the same set of documents — the second and third queries cost a fraction of the first.

Strategy 2: Request Batching

Both Anthropic and OpenAI offer batch APIs that process requests asynchronously at a 50% discount. If your workload does not require real-time responses — think nightly report generation, bulk document classification, content moderation queues, or data extraction pipelines — batching is the single easiest cost reduction you can make.

The tradeoff is latency. Batch requests are processed within a 24-hour window (typically much faster, but without guarantees). For any workload where a few hours of delay is acceptable, this is free money.

import anthropic
import json

client = anthropic.Anthropic()

# Prepare batch requests
requests = []
for i, document in enumerate(documents_to_classify):
    requests.append({
        "custom_id": f"doc-{i}",
        "params": {
            "model": "claude-haiku-4-5",
            "max_tokens": 100,
            "messages": [
                {"role": "user", "content": f"Classify this document into one of: invoice, receipt, contract, letter.\n\nDocument:\n{document}"}
            ],
        },
    })

# Submit the batch
batch = client.batches.create(requests=requests)
print(f"Batch ID: {batch.id}, Status: {batch.processing_status}")

# Poll for completion (or use a webhook)
# Results arrive at 50% of standard pricing

A practical pattern we use is a hybrid approach: real-time requests go through the standard API for anything user-facing, while background processing jobs are automatically routed to the batch API. A simple queue (SQS or Redis) separates the two paths, and a worker process collects batch-eligible requests, submits them every 15 minutes, and processes results when they arrive.

Strategy 3: Intelligent Model Routing

Not every task needs your most capable (and most expensive) model. A classification task that needs to pick from 5 categories does not require the same model as a nuanced legal document summary. Model routing — sending each request to the cheapest model that can handle it well — is one of the highest-leverage optimizations available.

We implement this as a simple routing layer that examines the task type and selects the appropriate model:

MODEL_ROUTING = {
    # Task type → (model, max_tokens)
    "classification":     ("claude-haiku-4-5", 50),
    "entity_extraction":  ("claude-haiku-4-5", 200),
    "short_summary":      ("claude-haiku-4-5", 300),
    "detailed_analysis":  ("claude-sonnet-4-6", 2000),
    "creative_writing":   ("claude-sonnet-4-6", 4000),
    "code_generation":    ("claude-sonnet-4-6", 4000),
}

def route_request(task_type, prompt):
    model, max_tokens = MODEL_ROUTING.get(task_type, ("claude-sonnet-4-6", 1024))
    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        messages=[{"role": "user", "content": prompt}],
    )
    return response

The price difference is substantial. As of early 2026, Claude Haiku is roughly 4x cheaper than Sonnet on input tokens and nearly 4x cheaper on output tokens. For tasks where Haiku performs within 5% of Sonnet's quality — and there are many — you are leaving money on the table by using the larger model.

To validate your routing decisions, run an evaluation set of 100–200 representative examples through both models, score the outputs (manually or with an automated eval), and confirm that the cheaper model meets your quality threshold for each task type. We maintain these eval sets per client and rerun them quarterly as models get updated.

Strategy 4: Token Optimization

Token optimization is about reducing the number of tokens in both your prompts and the model's responses without losing information. There are several practical techniques.

Trim Your System Prompts

System prompts tend to accumulate instructions over time as developers add edge-case handling. Review yours critically. Remove redundant instructions, consolidate similar rules, and test whether removing a paragraph actually changes output quality. We have seen system prompts shrink from 4,000 tokens to 1,500 tokens with no measurable quality loss — that is a direct cost savings on every single API call.

Constrain Output Length

Set max_tokens to the smallest value that covers your use case. If you need a one-sentence classification, set max_tokens to 50, not 1024. More importantly, instruct the model to be concise in your prompt. A simple addition like "Respond in 2-3 sentences maximum" can cut output token usage by 60% for summarization tasks.

Use Structured Outputs

When you need structured data back from the model, request JSON and parse it programmatically. This is both cheaper (JSON responses are typically more concise than prose) and more reliable. Most API providers now support constrained output schemas that guarantee valid JSON:

# Instead of: "Analyze this customer feedback and tell me the sentiment,
# key topics, and urgency level"
# Use structured output:

response = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=150,
    messages=[{
        "role": "user",
        "content": f"""Analyze this feedback. Return JSON only:
{{"sentiment": "positive|negative|neutral", "topics": ["topic1"], "urgency": "low|medium|high"}}

Feedback: {feedback_text}"""
    }],
)

Compress Context Documents

In RAG pipelines, the retrieved documents often contain boilerplate, headers, footers, and formatting that consume tokens without adding useful context. Pre-process your documents to strip HTML tags, remove repeated headers, collapse whitespace, and truncate to the most relevant sections. A simple preprocessing pipeline can reduce document token counts by 30–40%.

Putting It All Together: A Cost Optimization Checklist

Here is the order we recommend implementing these optimizations, from easiest to most involved:

Instrument your costs. Add token and cost logging to every API call. You cannot optimize what you cannot measure. This takes an hour to implement and immediately reveals your biggest cost drivers.
Enable prompt caching. If your system prompt is over 1,000 tokens and you handle more than 50 requests per hour, this is a quick win. Implementation takes less than a day.
Route simple tasks to smaller models. Identify your classification, extraction, and short-answer tasks. Evaluate Haiku or GPT-4o-mini against your current model. Switch the ones that pass your quality bar. This typically takes a week including evaluation.
Move batch-eligible workloads to the batch API. Any task that does not need a sub-second response is a candidate. The 50% discount applies automatically with minimal code changes.
Optimize token usage. Trim system prompts, constrain output lengths, switch to structured outputs where possible, and preprocess context documents. This is ongoing work that compounds over time.

Real-World Impact

To give concrete numbers: one of our clients was running a document processing pipeline that analyzed incoming invoices, extracted key fields, classified them by department, and generated approval summaries. Their initial implementation used Claude Sonnet for every step, with a 4,000-token system prompt repeated on each call. Monthly cost: approximately $4,200 for 80,000 documents per month.

After optimization — prompt caching on the system prompt, routing extraction and classification to Haiku, batching the nightly summary generation, and trimming the system prompt from 4,000 to 1,800 tokens — the monthly cost dropped to $890. That is a 79% reduction with identical output quality, validated against a held-out evaluation set of 500 documents scored by the client's operations team.

The takeaway is simple: LLM APIs are powerful but not cheap, and the default way most teams implement them leaves significant money on the table. A week of focused optimization work typically pays for itself within the first month and continues saving money every month after that. The key is to start with measurement, then apply the strategies in order of effort-to-impact ratio.

Need a second opinion on your stack?

Start with a free 20-minute assessment call to scope the problem and decide whether a paid diagnostic or implementation step is worthwhile. Written findings are not included in the free call.

Get a Free Assessment → More Articles