You actually cant see your agents

Your agent monitoring is probably wrong. If you're tracking latency, error rates, and uptime, you're measuring the server, not the agent. An agent can return 200 OK and still do the wrong thing. Traditional monitoring tells you the system is healthy. It doesn't tell you the system made a good decision.

This is the observability gap for AI agents, and it's widening as agents move from demos to production systems that take real actions. You can't debug an agent by looking at HTTP status codes. You need to see what the agent decided, why it decided it, what tools it called, what data it used, and where the reasoning went wrong.

OpenTelemetry gives you a backbone for this. It wasn't designed for AI agents specifically, but the mental model maps well: an agent session is a trace, each action is a span, and the attributes and events on those spans capture the semantic signals you actually need.

Why HTTP monitoring fails for agents

The fundamental problem is non-determinism. A traditional API endpoint processes the same input and produces the same output. When something breaks, you look at the request, the response, and the logs. The debugging path is linear.

Agents don't work that way. Given the same input, an agent might take different paths depending on the context it retrieves, the order it processes information, and the intermediate results from tool calls. A debugging path for an agent looks more like a tree than a line. You need to trace the entire reasoning chain, not just the input and output.

The questions you need to answer are different too. "Was the API available?" becomes "Did the agent use the right tool?" "Was the response fast enough?" becomes "Did the agent's reasoning chain lead to the correct action?" "Did the request succeed?" becomes "Did the agent's decision match what a human would have done?"

None of those questions are answerable from HTTP metrics.

What good agent traces look like

A good agent trace reads like a story. You can follow the agent's reasoning from the initial request through each decision point to the final action. At each step, you can see what information the agent had, what it considered, what it chose, and why.

In OpenTelemetry terms, that means mapping agent sessions to traces, individual actions to spans, and decision context to attributes. An LLM call span should include the prompt, the model response, the token count, and the cost. A tool call span should include the tool name, the arguments, and the result. A decision span should include the options considered and the criteria applied.

The instrumentation work isn't trivial, but it follows a pattern. Every LLM call, every tool invocation, every permission check, and every decision point gets a span. The spans nest according to the agent's reasoning hierarchy. When something goes wrong, you trace backward from the bad outcome to the span where the reasoning diverged from what you expected.

This is the same debugging methodology that distributed systems engineers have used for years. The difference is that agent spans carry semantic content — timing and error data alongside actual reasoning artifacts. That makes them richer and harder to instrument, but far more useful when something goes wrong.

Session replay and cost management

Session replay — the ability to watch an agent's entire execution in detail — is powerful for debugging and essential for compliance. If an agent takes an action that causes harm, you need to reconstruct exactly what happened, step by step. That's not optional for regulated industries. It's a requirement.

The problem is cost. Full session replay in production generates enormous volumes of trace data. Most teams can't afford to capture everything. The practical approach is sampling: capture a percentage of normal sessions in full detail, and automatically trigger full capture on anomalies — high cost, long duration, errors, or unexpected tool calls.

This gives you two things: a statistical picture of normal agent behavior and complete visibility into the sessions where something went wrong. It's the same tradeoff that distributed systems teams make with request sampling, and it works for the same reasons.

Choosing an observability backend

You have three categories of tooling. General-purpose distributed tracing systems like Jaeger, Grafana Tempo, and Datadog accept OpenTelemetry data natively and give you powerful query and visualization capabilities. They're built for volume and they integrate with your existing infrastructure. The downside is they don't understand agent semantics — you'll need to build your own dashboards and alerts for agent-specific signals.

Purpose-built AI observability platforms like Langfuse, Phoenix, and LangSmith are designed for LLM and agent workloads. They understand prompts, completions, token costs, and multi-step reasoning chains out of the box. The tradeoff is they're less mature, handle less volume, and may not integrate well with your existing monitoring stack.

The hybrid approach is usually the right answer: export OpenTelemetry data to your general-purpose system for infrastructure monitoring and to an AI-specific tool for agent debugging. You use one system for "is the agent healthy?" and another for "is the agent correct?" They're different questions and they benefit from different tooling.

The operational payoff

The teams that invest in agent observability early get a compounding advantage. Every trace becomes training data for understanding agent failure modes. Every debugging session generates knowledge about where agents break and why. Over time, you build an institutional understanding of your agent systems that's impossible to develop without traces.

This is the same pattern we saw with microservices: the teams that invested in distributed tracing early shipped faster, debugged faster, and operated with more confidence than the teams that tried to operate complex systems with basic monitoring.

Agents are distributed systems. Treat them like it. The observability investment pays for itself the first time you need to explain why an agent did something wrong — and you can actually answer the question.

You might also like

Agents make retrieval harder, not obsolete

What a Lisbon cafe told me about Anthropic's marketplace experiment