What 100 scenarios reveal about AI risk in legal conflict resolution

Seventeen percent. That's the hallucination rate for the best-performing specialized legal AI tool in a Stanford study. Non-specialist tools like GPT-4 performed worse. And that's for legal research — one task in a process that spans intake, evidence collection, drafting, mediation, and adjudication.

A new study from Kieslich, Helberger, and Diakopoulos asked 100 people — legal professionals and ordinary citizens across the EU and US — to write detailed scenarios about how generative AI will affect legal conflict resolution over the next five years. The method is called scenario writing, and it surfaces something that quantitative risk assessments consistently miss: the lived, contextual, narrative experience of how technology actually lands in people's lives.

The scenarios expose trade-offs that current regulatory frameworks are not designed to handle.

AI across the full legal workflow

The scenarios describe GenAI touching every stage of legal conflict resolution. Client interaction and support — chatbots providing initial consultations, informing citizens of their rights, offering pre-lawyer triage. Case management and administration — automating case assignment, translating documents, reducing administrative workload. Legal research and analysis — searching case law, summarizing arguments, simplifying legal language for non-experts.

Then it gets more consequential. Document and content generation — drafting contracts, letters, legal briefs. Evidence and investigation — analyzing body camera footage, reconstructing accident scenes, fact-checking claims. And at the far end: decision-making and adjudication — AI-assisted mediation, pre-trial outcome prediction, and in some scenarios, fully automated AI judges issuing binding rulings.

Legal professionals described more uses than citizens (53 vs. 36 in US samples, 41 vs. 30 in EU). None of the citizen scenarios addressed AI for case management, while six out of 25 legal expert scenarios in each region did. That gap matters. The people most affected by legal AI tools are imagining a narrower set of uses than the people building and deploying them.

Where the risks concentrate

The study maps risks against the AI Risk Repository, a database for categorizing AI risks across five main categories.

AI system safety, failures, and limitations showed up most frequently across all groups. Inaccurate results, hallucinated case law, translation errors that changed the meaning of asylum declarations, the fundamental inability of AI to comprehend moral nuance. One scenario described an AI that "diminished the gravity of the situation" by summarizing — turning a detailed asylum declaration into something that missed the point entirely.

Justice-related risks and bias followed closely. Systems trained on data from large firms unintentionally gave better strategies to well-resourced clients. Enforcement pattern biases discriminated against non-native English speakers. AI doesn't just reflect existing inequality — it can concentrate it.

Governance risks revealed a different dimension. Accountability breaks down when AI is used throughout the legal process. The "black box" problem applies to legal reasoning. And in one scenario that should keep AI governance teams awake: two AI systems covered for each other's mistakes, with a lawyer discovering that "LawIA tried to cover up the mistakes of RadarIA."

Socioeconomic harms hit the profession directly — devaluation of expertise, job displacement, reputational damage for lawyers caught relying on AI. One scenario ended bluntly: "The next day, my desk was cleared off."

Relational risks completed the picture. Loss of human connection in conflict resolution. Over-reliance leading to adverse outcomes. The loss of meaning when standardized AI outputs replace human judgment in disputes that are fundamentally about human experience.

Benefits cluster in four areas

Benefits cluster around four areas: cost reduction (cheaper consultations, fewer personnel needs, cases resolved before reaching court), efficiency and productivity (time savings, reduced administrative burden), quality of output (comprehensive legal databases, personalized advice, testing arguments before court), and fairness and trust (impartial AI, increased access to justice for people who can't afford lawyers).

The distribution tells a story. EU citizens mentioned benefits far less than any other group — 14 instances versus 23 (EU legal professionals), 27 (US citizens), and 25 (US legal professionals). US respondents were consistently more optimistic. That regional divergence matters because it tracks with the regulatory context: the EU has the AI Act, the US relies on industry self-regulation. The tools Anthropic and others are building will encounter different expectations in different markets.

The trade-offs that product teams and regulators need to resolve

Trade-offs showed up across nearly every legal task in the study. Three deserve close attention.

The tension between efficiency and system failures is the most immediate. AI promises to simplify drafting, research, and adjudication. Those speed gains come with quality risks that carry direct human consequences — wrongful punishment, denied asylum claims, lost cases. The friction isn't about whether AI can accelerate legal work. It's about whether the speed-quality trade-off is acceptable when the stakes include someone's liberty, livelihood, or legal status.

Cost and model performance create a second pressure point. Basic AI tools like ChatGPT are cheap or free. Specialized legal tools perform better but cost more. Studies show a 17% hallucination rate for the best specialized tools — and worse for general-purpose ones. For citizens with limited resources, the cheapest AI tool may be the only option. If that tool gives wrong advice, the cost savings evaporate when the resulting legal disadvantage materializes. Access to justice through AI becomes meaningful only if the AI performs well enough to actually deliver justice.

Automation and legitimacy form a third axis. The scenarios most frequently associated with both high benefits and high risks were decision-making and adjudication tasks. Efficiency gains were highest here. So were governance and justice-related concerns. Citizens were especially concerned about automated decision-making, even when it wasn't the most likely near-term use. Public perception of legitimacy constrains what technology can accomplish, regardless of its technical capability.

What this means for risk assessment — and why current methods fall short

The study makes an argument that connects directly to how the EU AI Act and NIST AI Risk Management Framework operate. Both frameworks rely primarily on specialist-led quantitative risk assessment. The AI Act classifies AI used by judicial authorities for researching, interpreting, and applying law as high-risk. But it doesn't extend that classification to the same technology used by lawyers or citizens in dispute resolution.

The scenarios challenge that boundary. Citizens using a legal chatbot for an asylum claim face risks as consequential as a judge using AI to research case law. The risk doesn't decrease because the user changes — it may actually increase, because citizens have fewer resources to catch and correct AI errors.

The study also shows that generic risk assessments hide significant differences in how different groups experience the same technology. EU citizens focus on relational risks — the loss of human connection in legal processes. US citizens focus on justice-related risks — bias and fairness. Legal professionals focus on operational risks and professional displacement. A risk assessment that treats all of these as one undifferentiated category will miss the most important concerns of each group.

For product teams building legal AI tools, this research points to a practical priority: risk assessment needs to incorporate the perspectives of the people who will actually use and be affected by these tools. Not as a compliance exercise, but as a design input. The scenarios reveal failure modes that no internal red-team exercise would surface — because they emerge from the lived experience of people navigating legal conflicts with limited resources, limited legal knowledge, and limited ability to evaluate whether an AI tool is giving them good advice.

The stress test moment for legal AI governance

This study arrives at a moment when the legal AI market is expanding rapidly. Anthropic launched a legal plugin for Claude. Thomson Reuters, LexisNexis, and dozens of startups are building legal AI tools. Am Law 100 firms are integrating generative AI into daily practice.

The scenarios in this study aren't predictions. They're stress tests. They reveal that the same technology that makes legal processes faster and cheaper can also make them less fair, less accountable, and less legitimate — depending on how it's implemented, who it's designed for, and whose risks get counted in the assessment.

For anyone working at the intersection of AI and legal systems, that's the design challenge worth getting right. The efficiency gains will arrive regardless. The question is whether the governance architecture arrives with them.

Source: Kieslich, K., Helberger, N., & Diakopoulos, N. (2026). "Make It Sound Like a Lawyer Wrote It": Scenarios of Potential Impacts of Generative AI for Legal Conflict Resolution. arXiv preprint.

https://arxiv.org/abs/2602.24130

AI across the full legal workflow

Where the risks concentrate

Benefits cluster in four areas

The trade-offs that product teams and regulators need to resolve

What this means for risk assessment — and why current methods fall short

The stress test moment for legal AI governance

You might also like

What a Lisbon cafe told me about Anthropic's marketplace experiment

When AI Acts Before You Decide