LLMs create a new blind spot in observability

Traditional observability stacks — metrics, logs, traces — were built for deterministic systems. A request goes in, a response comes out, and when something breaks, you trace backward through the chain. That model worked for microservices. It does not work for LLMs.

Shahar Azulay's piece in The New Stack lays out the problem clearly: LLM-powered applications are probabilistic, multistep, and constantly evolving. The same input doesn't produce the same output. A single user query can trigger retrieval, multiple model calls, tool execution, parsing, and retries. Prompt templates change weekly, model versions get swapped without ceremony, and quality fluctuates without warning. Logs don't explain why a model hesitated. Metrics can't tell you if a hallucination landed on a customer's screen.

For product counsel and AI governance teams, this isn't just an engineering headache. It's a compliance gap hiding in plain sight.

The signals that actually matter

Azulay identifies the new telemetry dimensions teams need to track: token usage (because cost scales directly with prompt design), latency in the critical path, error rates across model and tool calls, and — critically — response quality, including hallucinations. None of these map cleanly to CPU, memory, or request counts. Which means traditional monitoring gives you a false sense of coverage.

What I find most instructive is the framing around prompt versioning. The article argues that prompt versions and runtime substitutions should be treated as first-class signals — "version control for language." When quality degrades, teams should be able to trace it back to a prompt change the same way they'd trace a regression to a code deploy. For product teams building AI features, that's a governance requirement masquerading as an engineering best practice.

Cost, quality, and compliance are the same conversation

One of the sharper observations in the piece: the biggest reliability issues are often cost issues in disguise. A premium model running tasks a smaller one handles fine. A verbose prompt driving half the monthly token bill. A hallucination traced not to the model but to stale context pulled from a vector store weeks ago.

In practice, that translates to something product counsel should pay attention to. If your team can't see these patterns, they can't optimize for them — and they definitely can't explain them to a regulator or a customer who received a hallucinated response. Observability isn't just operational hygiene; it's the evidentiary backbone of any defensible AI deployment.

Security as a first-order constraint

The article flags something that too many organizations are learning the hard way: AI workloads routinely carry customer data, internal documents, and proprietary knowledge directly into prompts. That means observability data itself becomes sensitive. Sending prompts or completions to a third-party monitoring service may violate the very data protection commitments your privacy team spent months negotiating.

Many organizations are responding by keeping LLM telemetry inside their own cloud boundaries — self-hosted or bring-your-own-cloud models. That's the right instinct. But the harder question is whether your data governance framework even accounts for observability data as a distinct category. Most don't. If your DPA covers model inputs and outputs but says nothing about the telemetry layer watching both, you have an unaddressed risk.

What this means for product and legal teams

The shift Azulay describes — from debugging code to evaluating model behavior, from request traces to workflow traces, from uptime to quality and correctness — has direct implications for how legal and governance teams need to think about AI in production.

First, if you're advising on AI product launches, ask whether the engineering team has observability that covers the full agent pipeline, not just the model call. A system that can't trace from retrieval through tool execution to final output is a system you can't audit.

Second, treat prompt management as a governance surface. If prompts are changing weekly without version control or quality tracking, your risk profile is changing weekly too — and nobody's documenting it.

Third, build cost visibility into your AI governance framework. Token economics aren't just a finance concern. Runaway costs signal architectural decisions that may also carry quality and compliance implications.

The article's closing line is worth repeating: LLM observability doesn't just make AI applications more reliable. It makes them cheaper, safer, and genuinely worthy of being called production-ready. For anyone responsible for AI governance, "production-ready" should be the minimum threshold — and observability is how you prove you've met it.

The signals that actually matter

Cost, quality, and compliance are the same conversation

Security as a first-order constraint

What this means for product and legal teams

You might also like

Treat Your AI Agent Like a Junior Associate—Not Like Magic

AI reasoning explanations fail four times in five: what to verify before shipping