The Promise and Peril of an AI Jury

Ultimately, the study serves as a crucial reality check. The goal isn't just to build an AI that can produce fluent text, but one that can reflect the complex, messy, and nuanced reality of human judgment.

6 min read
The Promise and Peril of an AI Jury

Sean Harrington & Hayley Stillwell, Michael Scott Is Not a Juror: The Limits of AI in Simulating Human Judgment (2025) (unpublished manuscript)


What if artificial intelligence could replicate the judgments of real jurors? With AI models becoming more sophisticated, it’s no longer a far-fetched question. Researchers, lawyers, and even courts are beginning to explore whether AI could simulate how a jury might react to evidence, potentially revolutionizing trial strategy and legal scholarship. This isn't just theoretical; a growing industry is already marketing AI jury consultants to law firms, and even judges have begun experimenting with AI to test legal arguments. That was the starting point for a recent study from researchers at the University of Oklahoma College of Law, who set out to test whether major AI models like GPT, Claude, and Gemini could actually think like jurors.

The researchers prompted the models to evaluate criminal trial evidence from the perspective of jurors with specific demographic profiles. They then compared the AI's judgments to those of hundreds of real human participants. The goal was to measure how closely the AI could mirror human reasoning.

The results, however, were not just inaccurate. The researchers found the models' failures were "systematic and sometimes strange," revealing deep-seated flaws in how today’s AI handles nuanced human judgment. From generating bizarre, fictional characters for the jury pool to enforcing rigid stereotypes, the study uncovered a fascinating gap between artificial intelligence and authentic human reasoning. Here are the four key takeaways.

AI Thinks a Realistic Juror Is a Sitcom Character

The researchers’ first major hurdle was simply getting the AI to create a believable jury pool. When asked to generate realistic jurors, the models often produced profiles with "implausible or fictional characteristics." The AIs seemed to be optimizing for novelty and narrative flair rather than statistical accuracy.

The examples were striking. One model generated a 102-year-old juror who worked as a part-time marine biologist and a part-time DJ. Another created a juror who was a white male paper company manager from Scranton, Pennsylvania, named "Michael Scott." It was not until the researchers saw the name “Michael Scott” that they realized the model had cast the lead character from The Office.

These were not just random glitches. They revealed more systemic issues...

This reveals a fundamental problem: the AI was not trying to simulate a real-world jury. Instead, it was engaging in what the researchers called "impression management." It prioritized creating interesting characters and simulating an aspirational, TV-drama version of diversity rather than reflecting the actual demographics of a typical jury pool.

AI Models Aren't Just Wrong, They're Biased in Wildly Different Ways

Even when the researchers bypassed the character-creation problem by feeding the AIs real demographic profiles from human participants, the models failed to match human judgment. To establish a benchmark, the study first measured how much real jurors naturally varied from their group's average rating of evidence. This "human noise floor" was an error of 2.03 points on a ten-point scale.

None of the AI models were that accurate. GPT-4.1 was the best performer, but its judgments still deviated from the human ratings by an average of 2.52 points (Mean Absolute Error)—a 25% increase in error compared to the human baseline. Claude (3.36) and Gemini (2.93) were even less precise. While that gap may sound small, it represents a meaningful deviation. In real-world trial settings, where just a 1- or 2-point swing in perceived evidence strength can alter the legal stakes, even modest average error rates can undermine evidentiary conclusions.

More importantly, the models weren't just randomly inaccurate; they each had a distinct directional bias, almost like a personality.

  • Claude: Consistently acted overly skeptical and underrated the strength of the evidence (Mean Signed Error of –2.34).
  • Gemini: Consistently acted more punitive and overrated the strength of the evidence (Mean Signed Error of +0.81).
  • GPT: Remained the most balanced of the three, with almost no directional bias (Mean Signed Error of –0.04).

The implication of these findings is significant. Relying on these tools for legal analysis wouldn't just produce errors; it could introduce significant, predictable distortions. An analysis run on Claude would systematically downplay incriminating evidence, while the same analysis on Gemini would systematically inflate it. These divergent "personalities" aren't random quirks. As we'll see, they appear to be a direct result of the secret, built-in instructions each AI is forced to follow.

AI Can Flatten Human Nuance into Rigid Stereotypes

One of the most concerning findings was how the models handled demographic traits. Instead of treating them as background context, the AIs sometimes used them as "prescriptive" behavioral scripts, substituting a demographic label for individual reasoning.

The most powerful example of this "deterministic identity encoding" came from GPT. In one of the trial scenarios, the model assigned the exact same rating—a 9 out of 10—to every single one of the 46 jurors in the dataset who identified as Black or African American. It did this regardless of their other traits, such as age, income, education, or political affiliation. The model repeated this behavior in the same scenario for all jurors with a high school diploma and all jurors with an associate degree. In essence, the moment the AI saw the label "Black" or "high school diploma," it stopped reasoning about the evidence and simply applied a pre-programmed score. This "flattening of identity into archetype" is the opposite of how humans behave. The study's data showed that real human jurors who shared these same demographic traits expressed a wide and healthy variation in their ratings, reflecting their unique perspectives and life experiences.

But the problem runs deeper than these obvious glitches. The study uncovered more insidious forms of stereotyping, like "post-hoc rationalization," where Claude invented educational justifications to defend a predetermined low score, invoking phrases like "as someone with higher education" only for highly-educated profiles, even though it gave nearly identical low scores to jurors with no degree. Another issue was "subtle amplification," where models exaggerated real-world differences between groups. For instance, the gap in average ratings between Republican and Democrat human jurors was a modest 0.38 points, but Gemini amplified this into a much wider 1.59-point spread, turning a subtle tendency into a stark political divide.

Secret, Built-in Instructions Are Shaping Their Judgments

Every time a user interacts with a major AI platform, the query is filtered through "hidden system prompts"—undisclosed instructions that guide the AI's behavior. These secret rules, which are different for each model, have a profound impact on their judgments and appear to create their distinct reasoning styles.

  • Claude's system prompt contains an "extensive ethical overlay." It is instructed to avoid harm, protect vulnerable groups, and to decline to help if a user's intent seems questionable. This produces a "cautious and deferential" tone, which likely contributes to its tendency to be overly skeptical and underrate evidence.
  • Gemini's system prompt focuses almost entirely on formatting rules, with no apparent policy restrictions on content. This lack of explicit ethical guardrails may contribute to its "highly assertive and rhetorically confident" style, resulting in a more punitive and evidence-overrating personality.
  • GPT's system prompt is "comparatively minimalistic," lacking the heavy-handed ethical instructions of Claude or the formatting focus of Gemini. This likely allows for its more "measured, analytical" tone and its balanced, unbiased performance.

The key takeaway is that these invisible design choices act as powerful filters on the AI's output. Any attempt to use these off-the-shelf platforms for serious analytical tasks risks "importing unexamined policy judgments baked into the system, without transparency or accountability."

A Path Toward a Better AI Simulator

The study reveals that today's major AI models fail to simulate human jurors due to two core issues. They lack procedural transparency, with hidden rules shaping their outputs, and they demonstrate substantive unreliability through "distortions in reasoning, flattening of demographic variation, and outputs that often reflected institutional ideology more than empirical cognition." But the researchers didn't stop at diagnosing the failure; they used their findings to build something better.

In a hopeful and constructive final step, the researchers took the data they collected from hundreds of human jurors and used it to "fine-tune" an open-source AI model. The results were dramatic. This new, custom-trained model outperformed GPT, Claude, and Gemini on both accuracy (with a Mean Absolute Error of just 1.67) and bias. This proves that building a better, more faithful simulation is possible if it's grounded in real human data and transparent methods.

Ultimately, the study serves as a crucial reality check. The goal isn't just to build an AI that can produce fluent text, but one that can reflect the complex, messy, and nuanced reality of human judgment.

Because in the end, a model that puts Michael Scott in the jury pool is not simulating juror reasoning, it’s simulating television.

The question for us, then, is how we build AI that reflects the messy, nuanced reality of human judgment, not just a simplified, statistical caricature of it.

https://dx.doi.org/10.2139/ssrn.5400737