You Changed the Prompt. Is the Model Better or Worse? You Don't Have a Test That Tells You.

Operating an LLM in production is not MLOps and it is not traditional ops. It is running a non-deterministic component with no ground-truth notion of 'correct,' where a one-line prompt edit is a deploy and there is no green checkmark that says it's safe to ship. The operational surface that replaces the unit test: evals as your test suite, prompts and model versions as deployable config, RAG freshness, observability for systems that have no 'wrong answer' to alert on, and rollout you can't fully validate offline.

By Sharon Sahadevan·June 10, 2026·13 min read

You change one line of a system prompt to fix an edge case a customer reported. You open the pull request. Now answer a simple question: did that change make the model better or worse? Not for the one case you were fixing, but across everything else it touches. The PR has tests, and they pass, because the tests check that the API returns 200 and the JSON parses. None of them check the only thing that matters, which is whether the answers got worse. You merge it because the diff is one line and it looks obviously safe. Two weeks later support escalations are up and nobody connects it to the prompt change, because there is no signal that would.

This is the defining problem of operating LLMs in production, and it is why "LLM operations" is its own discipline rather than a flavor of the two it gets confused with. It is not MLOps: you are almost certainly not training the model, and the training-side knowledge most MLOps content obsesses over (PyTorch, distributed training, gradient accumulation) is increasingly vendor-owned. And it is not traditional ops: traditional ops manages deterministic systems where the same input gives the same output and a test suite tells you green or red. An LLM gives a distribution of outputs, has no ground-truth notion of "correct," and changes behavior in response to inputs you don't think of as code: the prompt, the model version, the temperature, the retrieved context. The serving mechanics (batching, KV cache, autoscaling) are a solved-ish engineering problem with great tooling. The hard part, the part senior MLOps interviews actually probe, is operating a component you cannot unit-test and cannot fully predict. This post is that operational surface.

KEY CONCEPT

Traditional software is deterministic, so a passing test suite is a license to ship. An LLM system is non-deterministic and has no ground-truth "correct," so the passing-test-equals-safe-to-ship contract simply does not hold. Every instinct you built operating deterministic services (write a test, green means go, alert on errors, roll back on a spike) has to be rebuilt for a system whose failure mode is not an error but a worse answer that still returns 200. LLM ops is the discipline of regaining deploy confidence over a thing you can't unit-test.

Why there is no unit test for "is the answer good"#

Two properties break the test suite. First, non-determinism: at any temperature above zero the model samples, so the same prompt yields different completions, and even at temperature zero, batching and floating-point non-associativity on the GPU make bit-identical output across runs unreliable. You cannot assert output == expected when there is no single expected output. Second, no ground truth: for an open-ended generation (a summary, an answer, an explanation) there is no canonical correct string to compare against. "Good" is a fuzzy, multi-dimensional judgment (accurate, grounded, relevant, appropriately formatted, not refusing when it shouldn't) that a string equality check cannot express.

So the artifact you are shipping is not code with a defined output. It is a probability distribution over outputs, and the things that reshape that distribution (a prompt edit, a model-version bump, a temperature change, a new retrieval index) are exactly the things your deterministic test suite is blind to. A one-character change to a system prompt can shift refusal behavior across thousands of prompts. A provider's silent model update can change your output distribution overnight while your code is byte-for-byte identical. The deploy-confidence loop has to be rebuilt around that reality.

Evals: the test suite you actually need#

The replacement for unit tests is an evaluation suite: a curated set of inputs run against the system, with the outputs scored, on every change that touches behavior. The pieces:

A golden dataset. A representative, version-controlled set of inputs: your real traffic distribution, your known edge cases, the bug reports you've fixed (each fixed bug becomes a permanent regression case, exactly like a regression test). This is the highest-leverage asset in the whole system, and it is built from production traffic, not imagined in advance.
A scoring method. For each output, a score on the dimensions you care about. Some are programmable (does it return valid JSON, does it cite a source, is it under the length cap, did it leak PII). Others are judgment calls scored by LLM-as-judge: a strong model grades the output against a rubric. LLM-as-judge is powerful and now standard, but it has to be calibrated against human labels, because the judge has its own biases (it favors longer answers, its own style, the first option presented). An uncalibrated judge gives you confident, wrong confidence.
A gate. Evals run in CI on every prompt edit, model bump, and retrieval change, and the aggregate score has to clear a threshold to merge. The one-line prompt PR from the opening now has a real check: it either holds the eval score or it doesn't ship.

PRO TIP

Build the golden dataset from production, not from imagination. The cases you invent at your desk are the cases you already handle; the cases that break you are the ones real users send that you never anticipated. Sample real traffic (especially the requests that triggered a complaint, a refusal, or a thumbs-down), label them, and fold them into the eval set. Every production incident should end with a new permanent eval case, so the same regression can never ship twice. The dataset that started as 50 hand-written examples and grew to 2,000 from real traffic is the one that actually catches regressions.

Everything that shapes behavior is config, so version it#

In a deterministic service, behavior lives in code, and code is versioned, reviewed, and rollback-able. In an LLM system, behavior lives in a set of artifacts that teams routinely treat as casual settings:

The prompt. A system prompt is not a string constant; it is the single biggest lever on behavior. It belongs in version control, behind review, with the eval gate above. A prompt edit is a deploy. Editing it directly in a dashboard with no version history is the equivalent of SSHing into prod and editing the running binary.
The model version. Pin it. "Use the latest" means your behavior changes when the provider ships an update you didn't choose and can't roll back. Pin to a specific version, test the upgrade through evals like any other change, and control when it lands.
Sampling parameters. Temperature, top-p, max tokens: they change the output distribution. Versioned config, not magic numbers scattered through the code.
Retrieval configuration. For a RAG system, the embedding model, chunk size, top-k, and the index contents are all behavior. Re-embedding with a new model is a behavior change that needs to go through evals.

The discipline is simple to state and rare to find: treat the prompt, the model version, the sampling params, and the retrieval config as a single versioned, deployable, rollback-able unit, gated by evals, exactly as you'd treat application code. Most "mysterious" LLM regressions are an un-versioned change to one of these that nobody recorded.

RAG: the context can rot under you#

If you're running retrieval-augmented generation (and most production LLM systems are) you've added a component that degrades silently over time even when nothing in your code changes. The model's answer is only as good as the context retrieved for it, and retrieval quality decays: the index goes stale as the underlying documents change, embeddings drift from the current query distribution, and a document that used to rank top-k for a query stops doing so as the corpus grows. The model then answers confidently from stale or wrong context, and you get a grounded-looking hallucination, the worst failure mode, because it looks authoritative.

So RAG adds its own operational surface: freshness (a re-indexing pipeline with monitored lag, so the index reflects current truth), retrieval evals (precision/recall on whether the right documents are retrieved, scored separately from the final answer, because a bad answer might be a retrieval failure or a generation failure and you need to know which), and groundedness scoring (does the answer actually follow from the retrieved context, or did the model fill gaps from its parametric memory). When an answer is wrong, the first diagnostic question is always "did we retrieve the right context, and did the model stick to it," and you can only answer that if you instrumented both halves.

Observability when there is no "wrong answer" to alert on#

Traditional observability alerts on errors and latency, discrete, unambiguous failure signals. An LLM's characteristic failure is not an error; it's a plausible answer that's subtly wrong, which throws no exception and emits no 500. You cannot alert on "the answer was bad," so you instrument the proxies that correlate with quality and cost and watch them move:

Per-request traces: the full prompt, retrieved context, model version, sampling params, and output for every request, so you can reconstruct why a specific answer happened. Without this, debugging a bad output is archaeology.
The cost and latency signals: tokens in/out per request (cost), time-to-first-token and throughput (the latency SLO). These connect directly to the serving and cost posts: the cost side is Prompt Economics, the latency side is set by continuous batching and the KV-cache pressure the KV cache wall post describes.
The behavioral proxies: refusal rate (a spike means the model started declining things it shouldn't, often after a prompt or model change), guardrail trigger rate, output length distribution, retrieval hit quality, and the strongest online signal of all, user feedback (thumbs, regenerate clicks, abandonment) which is your only continuous source of real-world quality labels, and feeds straight back into the golden dataset.
Drift: the input distribution shifts (users ask new things), the output distribution shifts (the model changed), and the gap between offline eval performance and online behavior widens. You watch for the divergence because offline evals can pass while production quietly degrades.

Rollout you can't fully validate offline#

Because evals approximate production but never fully capture it, you ship behavior changes the way you ship risky infra changes: incrementally, with online comparison. Canary the new prompt or model version on a slice of live traffic and compare its online metrics (refusal rate, user feedback, cost, latency) against the control before widening. Where you can, run shadow traffic: send real requests to the new configuration in parallel without serving its output to users, and score the shadow outputs offline against the live ones. The offline eval gate gives you the confidence to start the rollout; the online canary gives you the confidence to finish it. Neither alone is enough, because the eval set is a sample and production is the population.

This is also where the serving infrastructure rejoins the story: a behavioral canary is a traffic-splitting and routing problem on top of your inference fleet, which is the LLM Inference on Kubernetes layer, running on GPUs whose economics are the GPU Cost Optimization story. LLM ops is the behavioral-reliability layer sitting on top of the serving stack the other posts cover, not a replacement for it.

Common mistakes#

Treating a prompt edit as a config tweak, not a deploy. It is the biggest behavior lever you have. Version it, review it, gate it on evals, and keep the rollback path.

"Use the latest model." Unpinned model versions mean the provider reshapes your behavior on their schedule. Pin, and upgrade deliberately through evals.

No golden dataset, or one built at a desk. Without a representative, production-sourced eval set, you have no test suite at all. You're shipping on vibes. Build it from real traffic and grow it with every incident.

Uncalibrated LLM-as-judge. A judge model is a measurement instrument; an uncalibrated instrument gives precise, wrong numbers. Validate it against human labels before you trust its scores to gate deploys.

Scoring only the final answer in RAG. When the answer is wrong, you can't tell whether retrieval failed or generation failed unless you evaluate both. Score retrieval quality and groundedness separately.

Alerting only on errors and latency. The characteristic LLM failure throws no error. Instrument refusal rate, feedback, groundedness, and drift, or you'll learn about regressions from your support queue.

Shipping behavior changes all-at-once. Offline evals are a sample, not the population. Canary on live traffic and compare online before widening.

Ignoring the stale index. RAG quality decays with no code change. Monitor freshness and re-index on a tracked cadence.

The mental model#

Operating a deterministic service, you ship code: a fixed mapping from input to output that a test suite can pin down, where green means safe and an error means rollback. Operating an LLM, you ship a probability distribution over outputs whose shape is set by a bundle of artifacts (prompt, model version, sampling, retrieved context) none of which a string-equality test can capture, and whose failure mode is a confident wrong answer that returns 200. Every operational practice follows from that one shift. The test suite becomes an eval suite over a golden dataset. The config becomes versioned, gated, rollback-able behavior. Observability moves from errors-and-latency to proxies-for-quality, because there is no exception to catch. Rollout becomes a behavioral canary, because offline never fully predicts online.

That is the gap between knowing what a transformer is and being able to operate one in production, and it's the gap senior MLOps interviews are built to find, the same way the broader reasoning gap shows up across every senior infra loop. The training side is increasingly someone else's problem. The serving side is a well-equipped engineering problem. The behavioral-reliability side, regaining deploy confidence over a system you can't unit-test, is the one that's actually yours, and it's where the discipline lives.

The full LLM operations curriculum (foundations through model lifecycle, prompting and context, inference and performance, production architectures (RAG, agents, multimodal), and the safety, evaluation, governance, and cost-engineering surface this post is about) is the LLM Operations course. The serving infrastructure it runs on is the LLM Inference on Kubernetes course, the GPU foundations beneath that are Production GPU Infrastructure, and the economics are GPU Cost Optimization. Related reading: Prompt Economics for the token-cost signal you observe per request, Your GPU Finishes a Request and Waits for the Slowest and Your LLM Cluster Is at 90% HBM and 60% Is KV Cache for the serving layer underneath the behavioral one, and Most Courses Teach Tools. Senior DevOps Interviews Test Architecture. for the reasoning frame these MLOps interview questions test.

More in LLM Infrastructure

LLM Infrastructure·Jun 9, 2026·16 min read

kubectl drain Killed a 90-Second Inference Request. Stateless Drain Logic Doesn't Work for GPU Pods.

Draining a GPU node in the middle of a long inference request is how you teach your users what 503 looks like. A stateless pod evicts in seconds; a vLLM pod has a minute of cold start and requests in flight for two. The three things a production drain needs (a real grace period, a preStop that drains the engine, and a readiness gate that fails the instant drain starts), plus why the same pattern is load-bearing for spot preemption, autoscaler downscaling, and rolling upgrades.

Read post

LLM Infrastructure·May 31, 2026·14 min read

Your GPU Finishes a Request and Waits for the Slowest. Continuous Batching Is the Fix.

Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.

Read post

LLM Infrastructure·May 30, 2026·16 min read

Your HPA Scales LLM Pods on CPU. They're Either Idle or On Fire.

The default Kubernetes autoscaler watches CPU. Your GPU sits at 100% no matter what. So your inference fleet either never scales or scales 90 seconds too late, after the cold start, after the SLO already broke. The signals that actually predict load, the KEDA wiring, and the cold-start tax that makes reactive scaling a trap.

Read post