From testing to reviewing: evaluating AI agents that run 30-step workflows

The gap between AI agent demonstrations and production deployments has defined the past two years of generative AI adoption. Systems that worked well enough to generate compelling demos consistently failed when organizations tried to scale them beyond controlled environments. Based on insights from a recent conversation between Ben Kus from Box and Harrison Chase, founder and CEO of LangChain, the core challenge wasn't architectural complexity or workflow design—it was the underlying models themselves, and our fundamental misunderstanding of how to build, evaluate, and manage systems that make autonomous decisions across dozens of steps.

Chase's perspective matters because LangChain sits at the intersection of agent development and production deployment. The patterns he describes—30-page prompts, model-maximalist architectures, and the shift from testing to reviewing—represent not speculative futures but current operational realities for teams building sophisticated AI systems. For legal teams governing AI deployments and product teams designing agent interfaces, these insights reveal why traditional software quality frameworks don't transfer to autonomous agents, and what must replace them.

Why early agents failed: insufficient model capability, not flawed architecture

The initial wave of AI agents in the GPT-3.5 era generated significant attention but failed to achieve production reliability. According to Chase, the reason was straightforward: "the models were just not good enough." These systems could handle simple sequential tasks—finding a celebrity's girlfriend's age and performing a calculation with it, for instance—but they collapsed under the weight of multi-step workflows requiring sustained context and direction.

The failure mode followed a predictable pattern. Early agents would work "one in N times," generating compelling demonstrations that suggested broader capability. Organizations would then attempt to deploy these systems in production environments, only to discover that over any workflow requiring more than a few steps, the agents would either accumulate errors or lose focus as their context window expanded. The architecture wasn't fundamentally broken; the underlying models simply lacked the capability to maintain coherent reasoning across the number of steps required for useful work.

This reality demystifies the recent progress in agent reliability. The breakthrough wasn't primarily new software architectures or clever prompt engineering techniques, though both contributed. The fundamental shift came from model capabilities crossing a threshold where the core tool-calling loop that inspired systems like AutoGPT could finally operate reliably. As Chase notes, the models are now "just better enough" to make the original architectural intuitions work in production.

Deep agents and the model-maximalism approach

The architectural pattern that has emerged over the past six months reflects what Chase calls "Deep Agents"—systems that represent a full-circle return to the general-purpose tool-calling loop that characterized early experiments, but this time supported by models capable of executing the pattern reliably. These systems operate on a philosophy Chase terms "Model Maximalism": rather than building intricate, rigid software workflows that constrain the agent's behavior, the goal is to "push all of the complexity into the prompt" and trust the model to navigate that complexity intelligently.

The mechanism works like this: Instead of a traditional workflow engine that forces the system through predefined steps, a Deep Agent uses a core loop where the large language model itself "determines when to go the next step." This creates a fundamental distinction in system behavior. In a workflow system, the software architecture dictates the sequence of operations. In a Deep Agent, the model evaluates the current state, consults its instructions, and decides autonomously what action to take next—whether that's calling a tool, requesting more information, or concluding the task.

The sophistication lies not in the loop itself, which remains relatively simple, but in the "harness" that supports it. This harness packages industry best practices into a batteries-included environment: built-in planning tools, file system access, error handling, and other capabilities the model can invoke as needed. The architectural complexity hasn't disappeared; it's been absorbed into the supporting infrastructure and the instructions that guide the model's decision-making.

The 30-page prompt: instructions as high-density programming

Perhaps the most counterintuitive aspect of modern agent development is what "prompt" actually means at production scale. When Chase refers to the prompts powering systems like Claude Code, he's not describing a few paragraphs of natural language instructions. These prompts run "20, 30 pages" long. Even the descriptions for individual tools an agent can use—specifications for a single function call—often exceed the length of an entire prompt a typical user might write.

This reveals a critical truth about agent development: prompt engineering at this level is a form of high-density programming. The capability and reliability of a production agent are deeply tied to the detail, nuance, and instruction encoded in its foundational prompt. These documents must specify not just what the agent should do, but how it should reason about edge cases, when it should escalate decisions to human review, what constitutes successful task completion, and how it should handle the inevitable ambiguities that arise in complex workflows.

For legal teams, this has profound implications for governance frameworks. Traditional software can be audited by examining code. Agent behavior is determined by instructions that might not be versioned like code, might be modified more frequently than traditional software releases, and might interact with model capabilities in ways that produce emergent behaviors not explicitly specified in the prompt. Documentation requirements and change control processes must adapt to this reality.

The uncanny valley of AI response time and delegation patterns

Agent user experience presents a challenge that Chase characterizes as an "uncanny valley" of response time, building on insights from "Swix on Twitter." Human tolerance for AI response delays follows a bimodal pattern. At one extreme, synchronous interactions demand sub-second responses that feel like seamless extensions of thought—in-line code suggestions, for instance, must not interrupt flow state. At the other extreme, asynchronous work can take hours or even days, treating the agent like a human coworker to whom you delegate substantial tasks, checking in at the beginning and end of the day.

The uncomfortable middle ground—a two-minute wait, for example—falls into the uncanny valley. This duration is too slow to maintain flow state but not long enough for the task to feel like fully delegated work that can be mentally set aside. The user is trapped in a liminal state, waiting but unable to context-switch effectively to other work.

This reframes the design challenge from "how do we make agents faster" to "how do we design the right interaction pattern for the task's natural duration." A 30-step research workflow that takes two hours might not need to be faster; it needs an interface that supports delegation and periodic check-ins rather than continuous monitoring. The human role shifts from active user to manager: "kicking them off and editing their work," as Chase describes it. The agent produces the first draft, and the human provides oversight, refinement, and final approval.

From testing to reviewing: why traditional evaluation fails

The evaluation challenge for complex agents represents a fundamental departure from traditional software quality assurance. Simple AI functions can be measured with conventional evaluation sets: create an input, specify the expected output, and verify the system produces it. An email agent that should call the calendar tool when it receives a meeting request can be tested with the input "Are you free on Tuesday?" and checked for a specific function call.

This model breaks down completely for deep agents executing 30-step workflows. As Chase explains, evaluating such a system is "more like evaluating a human's work" than testing software. The task is complex, multiple valid trajectories could lead to acceptable outcomes, and success requires expert judgment rather than automated verification. A research agent might approach a problem through different reasoning paths, use tools in varying sequences, and still produce a valid result—or it might follow a plausible-seeming path that arrives at a subtly incorrect conclusion requiring domain expertise to identify.

The evaluation method Chase describes is review-based rather than test-based. The best evaluator is often the end-user, whose corrective feedback—"No, you did X. You should have Y"—creates a closed loop. This feedback not only corrects the agent in the moment but becomes the primary mechanism for creating memory and building personalized evaluation sets for future iterations. Real-world use becomes the continuous engine for improving performance rather than a separate quality assurance phase.

Review protocols for 30-step agent workflows

For legal teams, the shift from testing to reviewing creates new documentation and oversight requirements. When an agent executes a multi-step workflow—synthesizing research, drafting content, or analyzing data—the organization must be able to demonstrate that the output received appropriate human review before being relied upon for consequential decisions. This requires establishing review protocols that specify what level of scrutiny different types of agent output demand.

The first step is mapping agent workflows to organizational risk tolerance. A research agent that drafts an initial market analysis for internal discussion might require only cursory review for obvious errors. The same agent generating analysis that will inform regulatory filings demands comprehensive expert review. Legal teams should work with product teams to classify agent outputs by consequence level and establish corresponding review requirements: spot-check, expert review, or full verification.

The second step is creating review documentation standards. When a human reviews and approves agent output, that review should be logged with sufficient detail to demonstrate due diligence if the decision is later questioned. For high-consequence outputs, this might include the reviewer's qualifications, the specific aspects of the output they verified, any corrections they made, and their rationale for approving the final version. This documentation serves both as a quality control mechanism and as evidence that the organization exercised appropriate oversight.

The third step is establishing feedback loops. As Chase notes, corrective feedback creates memory and improves future performance. Legal teams should work with product teams to ensure that review corrections flow back to improve the agent's instructions or training data. When a reviewer identifies an error pattern—the agent consistently misinterprets a particular type of request, for instance—that pattern should trigger updates to the agent's foundational prompt or additional examples in its instruction set.

Delegation interfaces and feedback loops

For product teams, the shift to agent management creates new interface design requirements. Traditional software interfaces assume continuous user attention: the user issues a command, watches the system execute it, and immediately evaluates the result. Agent interfaces must support delegation patterns where the user initiates a task, context-switches to other work, receives a notification when the agent completes its work, and then reviews the output.

The first design challenge is status visibility. When an agent is working on a 30-step workflow over two hours, the user needs sufficient visibility to understand progress without requiring continuous monitoring. This might mean a dashboard showing completed steps, current activity, and estimated time to completion—or it might mean periodic notifications at logical checkpoints. As Chase muses, the right model might resemble "a video game almost where you see your little agents running around," though the specific interface pattern remains an open design question.

The second challenge is interruption and correction. Users must be able to intervene when they notice the agent taking an unproductive path without losing all the work completed up to that point. This requires interfaces that expose the agent's reasoning, allow users to provide corrective guidance, and enable the agent to incorporate that guidance and continue from the corrected state rather than restarting the entire workflow.

The third challenge is feedback capture. Every correction a user makes represents valuable training data for improving the agent's future performance. Product teams should design interfaces that make it easy to provide structured feedback—"you did X, you should have done Y"—and ensure that feedback flows back to the systems that generate the agent's instructions. This closed loop turns every user interaction into an opportunity for improvement.

Agent oversight as organizational capability

The insights Chase provides reveal that effective agent deployment requires organizations to develop new capabilities that sit at the intersection of traditional software quality assurance, knowledge management, and performance review. Agents are neither fully deterministic software systems nor fully autonomous coworkers, and governing them requires frameworks that acknowledge this hybrid nature.

Organizations that succeed with agent deployment will be those that recognize the shift from building rigid workflows to managing systems that make autonomous decisions within guardrails defined by sophisticated instructions. This means legal teams must develop expertise in prompt auditing and review protocol design, while product teams must develop new patterns for delegation interfaces and feedback loops. The capability to deploy agents reliably becomes an organizational capability, not just a technical implementation detail.

The transition Chase describes—from agents that fail unpredictably to agents that require management and review—marks a fundamental shift in how organizations incorporate AI into their operations. The question is no longer whether agents can reliably complete complex workflows, but whether organizations can develop the oversight frameworks to manage them effectively when they do.

References

Based on insights from Harrison Chase, founder and CEO of LangChain, in a conversation available at https://youtu.be/LSrGpWJCx94. The discussion covered the evolution of AI agent architectures, the emergence of "Deep Agents," and the shift from traditional software testing to review-based evaluation for complex autonomous systems.