Why AI agents work in demos but fail in production

A new research paper from Stanford, Harvard, UC Berkeley, and Caltech — "Adaptation of Agentic AI" — provides the clearest framework I've seen for diagnosing what goes wrong when agentic AI systems move from controlled demonstrations to real-world deployment.

3 min read
Why AI agents work in demos but fail in production

A new research paper from Stanford, Harvard, UC Berkeley, and Caltech — "Adaptation of Agentic AI" — provides the clearest framework I've seen for diagnosing what goes wrong when agentic AI systems move from controlled demonstrations to real-world deployment. The paper identifies three core failure modes: unreliable tool use, weak long-horizon planning, and poor generalization. More importantly, it maps a diagnostic framework that product and legal teams can actually use.

The demo-to-production gap is a supervision problem

Agentic AI systems sit on top of large language models and connect to tools, memory, and external environments. They already support software development, scientific discovery, and clinical research. But anyone who has tried to deploy one in production knows the pattern: impressive in the demo, brittle in the field.

The researchers model an agentic AI system as a foundation model agent with three components. A planning module decomposes goals into action sequences. A tool use module connects the agent to APIs, code execution environments, search engines, and browser automation. A memory module stores short-term context and long-term knowledge. Adaptation — the process of changing prompts or parameters for these components — is where things break down, and it's where this framework delivers real value.

Four paradigms, two dimensions

The framework defines four adaptation paradigms by combining two binary choices. The first dimension: are you adapting the agent itself or the tools it uses? The second dimension: does your supervision signal come from tool execution results or from final agent outputs?

This yields four paradigms:

  • A1 (Tool Execution → Agent Adaptation): The agent learns from verifiable tool feedback — SQL execution accuracy, retrieval quality, code execution results. Methods like Toolformer, ToolAlpaca, Gorilla, and DeepRetrieval fall here, often optimized with reinforcement learning using tool outcomes as reward signals.
  • A2 (Agent Output → Agent Adaptation): The agent learns from the quality of its final answers. This is where the paper surfaces a finding: systems that supervise only final outputs often teach agents to ignore their tools entirely. The agent can improve its likelihood score while bypassing the tools it's supposed to use. Effective A2 approaches either supervise both tool calls and final answers, or propagate sparse rewards through trajectories.
  • T1 (Agent-Agnostic Tool Adaptation): Tools are trained independently — optimized for retrieval accuracy, ranking quality, or simulation fidelity — without reference to a particular agent. These become reusable components.
  • T2 (Agent-Supervised Tool Adaptation): Tools are optimized under a frozen agent, which is the common scenario when your agent is a closed-source foundation model you can't modify. The learning signal flows from the agent's final outputs back to the tool's parameters.

Why this matters for product teams

For product teams building with agentic AI, this framework provides diagnostic clarity that has been sorely missing. When your agent fails in production, you can now pinpoint whether the breakdown is in agent adaptation, tool reliability, or the supervision signal itself. That distinction determines your fix.

The A2 finding is important for anyone deploying agents in regulated environments. If your agent learns to bypass its tools — say, a retrieval system connected to your compliance knowledge base — while still producing plausible-sounding answers, you have an invisible failure mode. The outputs look fine. The process is broken. For product counsel evaluating AI system reliability, this is the kind of architectural risk that needs to surface in technical due diligence.

The practical path forward

The researchers argue that practical systems will combine rare A1 or A2 updates on a strong base model with frequent T1 and T2 adaptation of retrievers, search policies, simulators, and memory. In other words, you train the core agent infrequently but continuously tune the tools around it.

For product teams, that translates to a concrete deployment principle: invest in observable, independently testable tool layers rather than relying on end-to-end fine-tuning of the agent itself. For legal and governance teams, it means your audit framework needs to distinguish between agent-level and tool-level failures — because the remediation paths are different.

This AI Paper from Stanford and Harvard Explains Why Most ‘Agentic AI’ Systems Feel Impressive in Demos and then Completely Fall Apart in Real Use
This AI Paper Explains Why Most ‘Agentic AI’ Systems Feel Impressive in Demos and then Completely Fall Apart in Real Use