Why LLM Observability Is Not APM
Traditional application observability is built around a simple question: "Is the request succeeding, and if not, where is it failing?" That question is almost never useful for LLM applications. When an LLM request returns "the answer is Paris" but the correct answer was "London," no amount of traditional logging tells you why.
LLM observability is about debugging reasoning, not debugging mechanics. The log schema, the tracing pattern, and the evaluation tooling all have to be designed for that reality.
The Minimum Log Schema
For every LLM call, these fields must be captured. Missing any of them means a non-trivial class of production bugs is unreproducible.
- Request ID — tying back to the user-facing request
- Model identifier and version — "claude-opus-4-6" is insufficient if the provider updates the model weights; capture the provider's returned model_version if available
- System prompt (full text) — not hashed, not truncated, full content. This is the biggest driver of behavior
- Input messages (full) — all turns, not just the latest
- Temperature, top_p, max_tokens, stop sequences — every parameter that affects output
- Tool/function definitions if the call uses tool calling
- Complete response — content, tool calls, finish reason
- Usage metrics — input tokens (cached vs uncached), output tokens, thinking tokens if applicable
- Latency — start-to-first-token, start-to-finish, provider-reported vs your measurement
- Retry history — if the request was retried due to error or content filter, all attempts
- User feedback signal — thumbs up/down, dismissal, conversion, whatever your product uses to indicate success
- Cost in dollars — calculated from tokens at the time of the call, stored alongside the log
The last point matters: pricing changes. Calculating cost at query time (later, across logs) means rebuilding the pricing table. Calculating cost at log time freezes the value.
Where to Store This
The volume is substantial. A moderately active LLM application generates 10–50 GB of logs per day. Three storage patterns that work:
- Hot tier (last 7–14 days): Elasticsearch or OpenSearch for search and aggregation. Supports "show me all failing conversations from yesterday."
- Warm tier (last 90 days): S3 Parquet partitioned by date and user_id. Query via Athena or Redshift Spectrum for retrospective analysis.
- Cold tier (90+ days): S3 Glacier. Retained for audit/compliance. Rarely queried.
Expensive-to-query cold storage is the right tradeoff for compliance retention. You don't need to query 9-month-old LLM logs at interactive speed.
Tracing Across Multi-Step Agents
A single user-facing request may trigger 5–20 LLM calls in a modern agent architecture. Traditional logging shows each call in isolation, which is useless for debugging "the agent answered wrong." What you need is trace correlation.
OpenTelemetry's emerging LLM semantic conventions (under the "gen_ai" namespace) are the standard we'd recommend if you're instrumenting now. Every LLM call becomes a span, nested under the parent request span. Tool calls become child spans. Retrieval calls (for RAG) become child spans. The full decision tree of the agent is inspectable in a trace viewer.
Jaeger, Tempo, or Honeycomb all render this well. Langfuse and Langsmith are LLM-specific trace viewers that have become de facto standard for teams running agent architectures.
Evaluation as Continuous Instrumentation
Logging captures what happened. Evaluation captures whether what happened was right. You need both.
Production evaluation patterns worth running:
- LLM-as-judge, offline: a sample of production logs is re-scored daily by a stronger model for quality/accuracy. Results trend over time.
- Regression suite, CI: a golden set of 200–500 prompts with expected responses, run against every model version change before deployment.
- User feedback rate tracking: thumbs-up rate, explicit correction rate, conversation abandonment rate. Segmented by user cohort, use case, and model.
- Drift detection: alert when the distribution of response length, tool usage pattern, or refusal rate shifts beyond baseline. Model providers do push silent weight updates.
The PII question: logging full prompts and responses means logging user PII that appears in them. Your log retention policy, encryption posture, and access control need to be explicit about this. Customers will ask. Auditors will ask. Tokenize or redact at log time for sensitive workflows — or segment your logging infrastructure so that only authorized staff can decrypt the sensitive fields.
What to Alert On
Alerting on LLM apps requires different thresholds than traditional apps. Some guidance:
- P99 latency > 2× baseline — usually indicates provider-side capacity issues; alert but page only during business hours
- Error rate > 2% — provider rate limits, auth issues, content filter trips; page
- User feedback thumbs-down rate > 15% — something changed in model behavior or prompt; page
- Cost per request > 3× baseline — prompt caching broke, or an agent is in a loop; page
- Average output token count shifts >30% — model behavior drift; investigate
Don't alert on every individual failed LLM call. Noise destroys signal.
The Bug Report Test
The final test for whether your LLM observability is sufficient: take a 3-day-old bug report of the form "the model gave a wrong answer when I asked about X." Can you, with existing logs, pull up the exact conversation, see the full prompt and full response, identify which model version was used, check whether caching was hit, see the tool calls made, and verify whether user feedback was captured?
If any of those is a no, you have gaps. The fix is not expensive — adding log fields is usually an afternoon of work — but you have to do it before the first audit or outage, not after.
The Bottom Line
LLM observability is the discipline that separates AI prototypes from AI production systems. Teams that treat LLM calls like any other API call will find they cannot debug, cannot improve, and cannot defend their system's behavior when it matters. Teams that build a proper log schema, trace architecture, and evaluation harness from day one have a compounding advantage.
Instrumenting your first LLM production system?
We build observability architectures for LLM and agent systems. 30-min scoping call, written recommendation in 5–7 business days.