top of page

When Your System Lies in Complete Sentences: Observing LLMs in Production

  • Writer: Chandra Sekar Reddy
    Chandra Sekar Reddy
  • May 24
  • 8 min read

Picture this scenario.

Your AI-powered feature is running. Every metric you watch looks clean. Error rate: flat. P99 latency: within SLO. HTTP 200s across the board. Your on-call engineer has nothing to page about.


But for the last three days, the model has been producing subtly wrong output. Not wrong in a way that crashes anything. Not wrong in a way that fires a single alert. The responses are fluent, confident, and structurally perfect. They just happen to be incorrect in ways that matter to your business. And your entire observability stack — the one you spent months building — has no idea.


This is not a theoretical edge case. It is a failure mode that is waiting inside every LLM-powered system that was instrumented for traditional reliability.


The question is not whether it will surface in your environment. The question is whether you will have the right signals to catch it in minutes or days.


Why this is different from every failure mode you have handled before

For over a decade in SRE, our mental model of failure has been fundamentally binary. Services are up or down. Requests succeed or fail. Latency is within SLO or it is not. The system announces when something is wrong. Our job is to hear it and respond. LLMs introduce a third state: technically successful, qualitatively wrong.


Your system responds in 650 milliseconds. It uses exactly the tokens you budgeted. No exceptions are thrown. No circuit breakers fire. The output is returned with complete grammatical confidence.


And it is wrong.

We have a name for this in the industry: soft failure. But most engineering teams are still running observability stacks designed exclusively for hard failure. The gap between those two postures is where your LLM reliability risk lives.


The uncomfortable truth is this: the error signal for a hallucinating or drifted LLM looks exactly like the success signal. Every assumption we trained ourselves to make — that a 200 OK means something worked, that low latency means good user experience, that zero errors means healthy operation — breaks down at the semantic layer.


To operate LLM-powered systems reliably, you have to think differently about what "working" actually means.


Five failure modes to design for before they find you

These are not hypothetical. Each one has a distinct signature, and each one is invisible to traditional monitoring without deliberate instrumentation.


1. Silent model drift

AI providers update model behavior behind the same API endpoint. No version bump in the response headers. No notification. No changelog item you will see unless you are specifically watching for it. Your prompts were tuned against one behavioral baseline. They may now be running against a different one.

What to watch: response style fingerprinting — a hash of stable response characteristics (sentence length distribution, structural patterns) that should remain relatively stable within a given model version but shifts meaningfully when the underlying model changes.


2. Context window degradation

In systems that maintain conversation history or use retrieval-augmented generation, long threads and large retrieved documents quietly push useful context out of the model's attention window. The model is not failing. It is reasoning on progressively less relevant information. Responses degrade gradually — then suddenly — with nothing anomalous at the infrastructure layer.

What to watch: input token count per request over time, and the ratio of retrieval content to actual prompt content. A creeping input token baseline is a warning sign before quality degrades visibly.


3. TPOT stutter

Most teams track TTFT — Time to First Token — and treat it as the primary latency metric for streaming applications. It is not.


The metric that users actually experience in a streaming interface is TPOT: Time Per Output Token. A model can have an excellent TTFT and a poor TPOT. The result is a response that starts quickly and then visibly stutters — slow, hesitant tokens that erode trust in the product. Your P99 latency SLO will not catch this. It reports on the full request duration, not the inter-token cadence.

What to watch: TPOT measured as the P99 of inter-token gap durations within a single response, tracked as a separate metric from TTFT.


4. Semantic injection at the input layer

In applications where user content flows into prompts — which is most production LLM applications — carefully constructed inputs can alter model behavior without triggering any application-layer detection. The request looks normal at every layer you are watching. The output is not.

What to watch: anomalous patterns in output structure relative to input type, especially for high-stakes output categories. Flag responses that deviate structurally from expected patterns for a given input class.


5. Cost explosion as a reliability signal

Token spend is not just a billing concern. It is an operational signal that most teams never connect to their reliability practice.When average tokens per request suddenly increases by 2x or 3x on a given endpoint, something has changed: prompt behavior, input pattern shifts, retrieval results bloating the context, or the model itself generating more verbose output. If this signal only appears in your monthly billing report, you are operating blind for weeks at a time.


What to watch: tokens per request tracked as a time-series metric, with alert thresholds on rolling average deviation, not just absolute cost.


How to actually instrument this

Here is the instrumentation baseline I recommend for every LLM call, built on OpenTelemetry GenAI semantic conventions:

A few things worth calling out explicitly.

The app.context.* attributes are not optional decoration. They are what allow you to slice LLM performance by what matters operationally. When TPOT degrades, you need to answer immediately: is this affecting all requests, or only high-complexity inputs with large contexts? That question cannot be answered without business context attributes on every span.


The token_efficiency_ratio is an early-warning signal. When a model starts behaving differently — whether from a provider update, prompt regression, or input pattern shift — the ratio of input to output tokens often changes before any other observable metric moves. It is a weak signal, but frequently the earliest one available.


The model_fingerprint is the most forward-looking piece. The concept is to hash stable stylistic characteristics of response output — structural patterns, sentence length distribution, punctuation density — that hold reasonably constant within a model version but shift when the underlying model changes. This is not foolproof, but it gives you a machine-readable tripwire for a class of failure that has no other detection mechanism.


The SLO problem you need to solve before you deploy
Here is where most teams get stuck, and where the right thinking separates teams that operate LLM systems from teams that merely run them.

You cannot write an SLO that says "99.5% of responses must be correct." Correctness is not machine-readable at scale. And yet, operating without a reliability target for output quality is not an acceptable posture for any production system that affects business outcomes.


Here is the framework I recommend.

Layer 1: Infrastructure SLOs — measure what is objectively measurable.

  • P99 TTFT < 1.2 seconds

  • P99 TPOT < 45 milliseconds

  • Hard LLM call error rate < 0.5%

  • Token cost per business outcome within ±25% of rolling 30-day baseline

These are your floor. Meeting them means your infrastructure is functioning. They say nothing about whether what the model is producing is actually useful.


Layer 2: Outcome SLOs — this is the real work.

Every LLM feature exists to produce some observable downstream effect: a decision is made, a workflow is triggered, a user takes an action, a downstream system accepts or rejects the input. When quality degrades, that downstream behavior changes in a measurable way — even if the LLM output itself is not directly machine-readable.


Your job as an SRE is to work with your product and engineering teams to identify that business-layer trip wire. In a risk assessment system, it might be the percentage of AI-generated summaries that trigger a downstream rules-based override. In a code generation feature, it might be the rate of generated code that fails compilation or test suites. In a customer support system, it might be the percentage of AI responses that get immediately escalated to a human agent.

The outcome SLO is: that trip wire rate must remain below X% in any rolling 24-hour window.

This is not a perfect quality signal. It is a business-layer proxy that correlates reliably with actual output degradation. It is the difference between catching a model drift event in four hours versus four days.


Layer 3: Error budget — the forcing function.

Once you have outcome SLOs, the error budget conversation becomes real. When your outcome SLO is breached, you burn error budget. When enough budget is burned, you halt LLM-dependent feature releases — exactly as you would halt releases when a traditional service SLO is at risk. This is the organizational step that transforms LLM quality from a "product concern" into an engineering accountability.


Most teams with AI features in production have never had this conversation. They treat model quality as something that happens to them, not something they are responsible for operating. The teams that get this right treat it as a reliability contract.


Three architectural decisions that change your posture

Beyond instrumentation, there are three design choices that separate reactive LLM operations from proactive ones.


Make provider identity a first-class operational signal. Never rely solely on a model name string in your configuration. Capture a response fingerprint at call time. Run a rolling distribution comparison against a baseline window. Alert on meaningful shifts — not on individual outliers, but on sustained distribution changes over a 30-minute window. Your provider SLA says nothing about behavioral consistency within a named model endpoint. Your observability stack has to be the backstop.


Alert on TTFT and TPOT separately, and route them differently. TTFT degradation typically points to model load, quota throttling, or cold start behavior. TPOT degradation often points to sequence-level issues — long context, complex generation patterns, or changes in model verbosity. These have different causes, different escalation paths, and different remediation playbooks. Collapsing them into a single latency SLO means you are treating fundamentally different problems as the same problem.


Version your prompts like production code, and treat prompt deployments like service deployments. Every LLM call should carry a prompt version identifier as a span attribute. Prompt changes should go through the same review and staged rollout process as code changes. And output quality signals should be tracked before and after every prompt version change, with the ability to roll back if quality metrics move adversely. Without this, prompt engineering is an invisible source of production risk — changes that affect reliability with no deployment event to correlate against.


The discipline underneath all of this

There is a larger shift in thinking that all of this points toward.


Traditional SRE was built on a foundational assumption: failures announce themselves. Services crash. Latency spikes. Errors accumulate. The system has a vocabulary for telling you when something is wrong.


LLMs are the first category of widely deployed production system where that assumption breaks routinely. The system completes successfully. It is fluent. It is confident. And it is producing output that ranges from subtly degraded to meaningfully wrong, with no signal at any layer you are currently watching.


This is not just an instrumentation gap. It is a conceptual gap.


The question that defined SRE for twenty years — is the service up? — is no longer sufficient when the service can be "up" and wrong simultaneously.


The question that needs to be added is: is the system reasoning correctly?


That question requires a different observability model. It requires SLOs at the outcome layer, not just the infrastructure layer. It requires engineering teams and product teams to share accountability for quality signals, not treat them as separate concerns. And it requires a willingness to define "reliability" more broadly than uptime and latency.


The teams that get there first will operate AI features with the same rigor they bring to their core services. They will catch degradation in hours, not days. They will build confidence in automated systems by understanding their failure modes, not by hoping they do not occur.


That is the reliability standard that LLM-powered systems deserve.


And it is entirely achievable — if you design for it deliberately, before the failure finds you first.


The views expressed in this post are my own and do not represent those of my employer.


Tags: Site Reliability Engineering · Observability · LLM in Production · OpenTelemetry · SLOs · AI/ML Engineering · Distributed Systems · Engineering Leadership

Comments


bottom of page