Why AI agent autonomy should be a design choice, not a technical inevitability

University of Washington framework argues AI agent autonomy should be a deliberate design choice separate from capability, proposing five user role levels from operator to observer.

6 min read
Why AI agent autonomy should be a design choice, not a technical inevitability
Photo by Diana Mialik / Unsplash

Citation: Feng, K. J. Kevin, David W. McDonald, and Amy X. Zhang. "Levels of Autonomy for AI Agents." arXiv preprint arXiv:2506.12469v2 [cs.HC] (2025).

I believe this autonomy framework provides product teams with the vocabulary and structure to make deliberate decisions about user control, which will influence liability exposure and competitive positioning as AI agents become widespread.

The University of Washington research team, led by K. J. Kevin Feng, presents a compelling argument that we have been approaching AI agent development in the wrong way. Instead of viewing autonomy as an inevitable result of increasing capability, they suggest it should be a conscious design choice that product teams can manage independently of how intelligent their AI systems become. Their five-level framework, released as a PDF alongside the Knight 1st Amendment Institute, offers the first systematic method I have seen for linking user involvement to agent behavior in ways that directly influence legal risks and product strategies.

The paper centers on user roles rather than technical capabilities. Level 1 positions users as operators who maintain control over planning and decision-making while agents provide on-demand assistance—think ChatGPT Canvas or Microsoft Copilot. Level 2 makes users collaborators who work alongside agents in shared workflows with the ability to take control at any point, exemplified by OpenAI's Operator. Level 3 treats users as consultants whom agents actively seek out for expertise and preferences, like Gemini Deep Research. Level 4 reduces users to approvers who only engage when agents encounter failures or high-risk scenarios, as seen in systems like SWE Agent or Devin. Level 5 relegates users to observers with only emergency shutdown capabilities while agents operate fully autonomously, as demonstrated by research systems like Voyager.

What makes this framework immediately useful for product counsel is how it links design choices to liability exposure. Level 1 and 2 systems keep human oversight and control clear, which supports traditional ideas of principal responsibility and makes compliance with current regulations easier. Users can intervene, adjust outputs, and keep meaningful decision-making authority throughout the workflow. Level 4 and 5 systems raise more complex accountability questions because humans have limited ability to understand or influence agent decisions in real-time.

The distinction between agency and autonomy is especially useful for risk assessment. Agency describes an agent's ability to act intentionally using available tools and environment, while autonomy indicates how much the agent functions without user involvement. A system with high agency but low autonomy might have access to many powerful tools but need frequent user approval before acting. Conversely, a system with low agency but high autonomy might have limited capabilities but operate continuously without supervision. This distinction helps product teams think systematically about which lever to pull for different risk levels and use cases.

The autonomy certificate concept introduces a governance system that could alter the way AI systems are deployed and regulated. These digital documents, issued by third-party authorities, would specify the maximum autonomy level at which an agent can operate based on its technical specs and operational environment. Certificate holders would submit both their agent and an "autonomy case"—evidence showing that the system behaves at the designated level and no higher. This approach mirrors safety cases used in other industries and offers stakeholders standardized information about agent behavior.

For product teams, certificates of autonomy could serve as competitive advantages in markets where trustworthiness is crucial. Systems certified at suitable autonomy levels might receive preferred treatment from enterprise clients, regulatory bodies, and business partners who increasingly value AI safety and accountability. The certification process would also require teams to clearly document their design decisions and control mechanisms, which enhances both internal governance and external transparency.

The evaluation methodology fills a gap in current AI assessment practices. While existing benchmarks focus on capability—whether systems can complete tasks accurately—they do not measure autonomy as a separate design dimension. The proposed "assisted evaluation" approach gauges the minimum level of user involvement needed for an agent to surpass accuracy thresholds. This separates autonomy assessment from capability testing, allowing for a more nuanced evaluation of agent behavior that aligns with real-world deployment scenarios.

The business implications extend beyond technical implementation to strategic positioning. Companies that master deliberate autonomy calibration can optimize user experiences for specific contexts rather than defaulting to maximum automation. A legal research agent might operate at Level 3 for document analysis but Level 1 for strategic advice, reflecting different risk tolerances and value propositions across tasks. This contextual approach to autonomy design could enable more precise market positioning and regulatory compliance.

The multi-agent considerations become increasingly relevant as organizations deploy multiple AI systems that interact with each other. The authors note that systems comprised entirely of Level 1 agents create coordination problems because all agents wait for operators to assign tasks. Systems with only Level 5 agents may have sparse communication that makes the overall system difficult to steer and debug. Strategic mixing of autonomy levels could optimize multi-agent system performance while maintaining appropriate human oversight.

The framework also emphasizes implementation challenges that product teams need to overcome. Level 2 systems require advanced interfaces for seamless task handoffs between users and agents. Level 3 systems need mechanisms for deciding when to consult users and how to gather high-quality feedback. Level 4 systems must accurately detect important actions needing approval while avoiding user disengagement from pointless rubber-stamping. Each level demands specific technical skills and user experience design that influence development timelines and costs.

Looking ahead, this framework provides structure for navigating the regulatory landscape as it develops around AI agents. Rather than waiting for prescriptive rules about specific technologies, product teams can use autonomy levels to anticipate regulatory requirements and design systems that align with emerging governance expectations. The emphasis on user roles and control mechanisms maps well to existing legal frameworks around delegation, supervision, and accountability that regulators understand.

The research suggests that effective AI governance requires moving beyond binary thinking about automation versus human control toward nuanced approaches that calibrate autonomy based on context, risk, and value. Product teams that internalize this framework early will be better positioned to make deliberate design choices that optimize both business outcomes and compliance posture as AI agents become standard business tools.

Levels of Autonomy for AI Agents
Autonomy is a double-edged sword for AI agents, simultaneously unlocking transformative possibilities and serious risks. How can agent developers calibrate the appropriate levels of autonomy at which their agents should operate? We argue that an agent’s level of autonomy can be treated as a deliberate design decision, separate from its capability and operational environment. In this work, we define five levels of escalating agent autonomy, characterized by the roles a user can take when interacting with an agent: operator, collaborator, consultant, approver, and observer. Within each level, we describe the ways by which a user can exert control over the agent and open questions for how to design the nature of user-agent interaction. We then highlight a potential application of our framework towards AI autonomy certificates to govern agent behavior in single- and multi-agent systems. We conclude by proposing early ideas for evaluating agents’ autonomy. Our work aims to contribute meaningful, practical steps towards responsibly deployed and useful AI agents in the real world.

TLDR: This document introduces a framework for AI agent autonomy, proposing it as a deliberate design decision separate from an agent's capabilities or environment.

Key points:

Five Levels of Autonomy: The framework defines five levels based on the user's role when interacting with an agent:

    ◦ L1 User as an Operator: The user directs and makes decisions, with the agent providing on-demand support and assistance.

    ◦ L2 User as a Collaborator: User and agent collaboratively plan, delegate, and execute tasks with frequent communication and shared progress.

    ◦ L3 User as a Consultant: The agent takes the lead in planning and execution but consults the user for expertise and preferences.

    ◦ L4 User as an Approver: The agent engages the user only in risky, failure, or pre-specified scenarios that require approval.

    ◦ L5 User as an Observer: The agent operates with full autonomy under user monitoring, with the only control mechanism being an emergency off-switch.

Autonomy Certificates: The framework proposes "autonomy certificates" as a governance mechanism. These digital documents, issued by a third-party, prescribe the maximum autonomy level an agent can operate at based on its technical specifications and environment. They are useful for risk assessment, designing safety frameworks, and engineering multi-agent systems.

Evaluating Autonomy: The document suggests "assisted evaluations" to measure autonomy independently of capability benchmarks. This involves assessing the minimum level of user involvement needed for an agent to successfully complete a task.