The Rubric Is the Product

Harvey recently published results hat are going to get read as an AI win. A team ran 12 legal tasks through an agent, built a feedback loop where an LLM judge scored the output against a rubric, had a coding agent cluster the failures and rewrite the surrounding workflow, and watched average scores climb from roughly 40% to nearly 88%. Seven of twelve tasks cleared 90%. One hit 100%. They called it "auto-research" and "harness engineering," and the framing is that you can hill-climb agent reliability without touching model weights.

The real story is the rubric.

Somebody had to sit down and write, for each of those 12 tasks, a description of what "good" looks like — detailed enough, specific enough, and unambiguous enough for a language model to grade another language model's work against it. For lease review. For complaint drafting. For tax memos. For disclosure schedules. For diligence responses. Those rubrics are the thing that made everything else possible. Without them, there is no judge. Without the judge, there is no feedback. Without the feedback, there is no loop. The loop is what people are celebrating. The rubric is what made the loop real.

And the rubric is a lawyer job.

For the last two years, the conversation about legal AI has treated model capability as the bottleneck — whether it can reason about contracts, handle novel fact patterns, keep track of cross-references in a credit agreement. Every release cycle got framed as a capability question, and every buying decision inside legal departments got framed the same way: which model, which vendor, which benchmark.

Harvey's experiment suggests the bottleneck has moved. The model is not what made those scores double. What made them double was a workflow that could tell the model when it was wrong in a way the model could act on. That's a harness problem, yes. It's also a definition problem. You cannot build a harness around a task you cannot articulate.

The quiet implication is that the firms and legal teams that will get the most out of agents over the next year are not the ones with the best model access. They are the ones who can describe their own work precisely enough to grade it.

Most cannot.

Legal practice runs on tacit knowledge. A senior associate reviewing a diligence response does not walk through a written checklist. They read, and something feels off, and they ask a question, and the answer reshapes what they're looking for. Good lawyers encode what "complete" and "accurate" and "responsive" mean through years of being corrected by people who encoded it the same way.

That's fine for training humans. It's useless as input to an evaluator loop.

To run Harvey's experiment on your own workflow, you have to take the thing the senior associate knows and make it explicit. You have to write down the criteria. You have to define the failure modes. You have to decide, in advance, what a 7 out of 10 looks like versus a 9 out of 10, and do it in language precise enough that a judge model reading it next week will grade the same work the same way you would.

This is not a prompt engineering task. This is a legal knowledge management task the profession has been avoiding for forty years, dressed up in new clothes.

The shape of the problem is the same everywhere. The experienced partner can tell you in thirty seconds whether a memo is any good. She cannot tell you, in writing, the twelve things she was checking for. Ask her to, and you'll get four. The other eight live in her reading reflexes.

Harvey's result is what happens when somebody does the work of extracting all twelve.

The rubric is also the audit trail.

If a regulator or a client or opposing counsel asks how an agent-produced work product was evaluated, the answer is no longer "we used a model and reviewed the output." The answer is the rubric itself — the criteria, the weightings, the failure categories, the judge's reasoning against each criterion. The rubric is the thing you hand over. The rubric is what you defend.

This matters for anyone thinking about AI governance in a legal organization. The harness is not a performance optimization that lives in the engineering team. It is a legal artifact. It encodes your firm's definition of competent work. It determines what the agent is being optimized toward, which determines where it will drift when conditions change, which determines where your risk actually lives.

Product counsel has a role here that nobody is currently filling. Somebody has to own the question of whether the rubric matches the standard of care. Somebody has to decide whether a 90% score against the rubric is actually a pass, or whether the rubric is grading the wrong things well. Somebody has to notice when the judge and the agent start agreeing with each other for reasons that have nothing to do with the underlying legal question. All of those are lawyer problems.

If you are sitting inside a legal team watching Harvey's results and wondering what the equivalent move is, the move is to pick one workflow you run often and try to write its rubric. Skip the agent infrastructure shopping.

Not the whole practice. One workflow. NDA review, maybe. Or a specific kind of diligence request. Something narrow enough that you can actually finish.

Then find out whether you can do it. Write down, with real specificity, what a good output looks like and what the failure modes are. Give the rubric to a colleague who does the same work and see if they grade the same sample the same way you do. If they don't, your rubric is not yet the thing you think it is — your colleague is reading from the same reflexes you are, and the rubric has failed to capture them.

Most legal teams will discover, in that exercise, that they do not yet have a sharable definition of their own work. That discovery is more valuable than any agent. It is the precondition for everything that comes next — for automation, for quality control, for training junior lawyers, for defending the work to a client who asks what you actually did. The agent conversation just made it urgent.

Harvey's harness works because somebody did the hard part first. The hard part is the part that looks like writing. The hard part is the part lawyers are actually built to do.

The model was never the bottleneck. We just kept looking at it because it was the new thing in the room.

You might also like

When software goes headless, who's watching the agents?

Chaining Tasks: AI task chains and the future of work