AI Beats Law Professors in Blind Legal Reasoning Tests

Law professors preferred answers generated by artificial intelligence over answers written by fellow professors, according to a recent study led by Stanford University that examined how large language models perform on legal reasoning tasks. Apparently, the robots are not only better at citing cases—they're also better at not droning on about them.

In the study, 16 professors from 14 U.S. law schools—including Stanford, Yale, New York University, the University of Chicago, Georgetown, UCLA, and the University of Virginia—created 40 contract law questions covering legal doctrine, case law, hypotheticals, and policy issues. Researchers saw it as an ideal way to test the capabilities of modern AI. "Large language models (LLMs) are increasingly promoted as educational tutors, yet most evaluations focus on domains with a single ground truth," the researchers wrote. "Many disciplines, however, hinge on judgment: reasoning, weighing ambiguity, and reaching defensible conclusions. Law provides a sharp test."

In 2,918 blinded comparisons, professors selected the answer they would rather give a student. Google's Gemini 2.5 Pro won 75.92% of its matchups against human instructors, while the tech giant's NotebookLM won 74.75% of the time, giving AI-generated results the nod over humans in roughly three-quarters of responses. The professors, one suspects, did not enjoy this part of the experiment.

According to the researchers, to determine whether the results reflected a broader professional consensus, the researchers analyzed how often professors agreed when evaluating the same answer pairs. "Observed agreement exceeded the level expected if judgments were entirely idiosyncratic, indicating that the LLMs' success reflects alignment with common disciplinary criteria," they wrote. In other words, they were not simply arguing about everything, for once.

The study found that AI models also outperformed human instructors across multiple categories, including recall questions relating to case, code, or doctrine, hypotheticals, and policy discussions. "To probe whether any LLM advantage might be driven by surface-level writing style rather than substantive content, we additionally engineered a set of lexico-syntactic features—answer length, structural organization, reasoning nuance, legal anchors, confidence tone, clarity, and pedagogical support—and tested how much of the preference pattern they could explain," the study said.

Courts around the world are straining under growing caseloads, and a pilot program in Los Angeles is hoping to change that by testing whether AI can assist judges without offloading their judgment. The Los Angeles Superior Court is testing an AI tool called Learned Hand that summarizes filings, organizes evidence, and generates draft rulings in civil cases. The goal is to reduce time spent on administrative tasks so judges can focus on the parts of a case that require legal analysis and discretion.

AI-generated answers were also flagged as harmful less often than those written by professors, with Gemini recording a 3.41% harmfulness rate and NotebookLM 3.64%, compared with 12.06% for human instructors. Humans, it turns out, are still better at the wrong things.

In a separate analysis of additional models, Anthropic's Claude Opus 4.7 ranked first, followed by OpenAI's ChatGPT 5.4 and Gemini 2.5 Pro, while every AI model evaluated outperformed human instructors on average. The professors, to their credit, did not resign on the spot.

The researchers cautioned that the study did not measure whether the answers matched each professor's individual teaching preferences, leaving open the possibility that AI-generated responses were viewed as generally acceptable rather than tailored to any one instructor's approach. "While LLM responses are generally preferred over those of human instructors, our evaluation setting does not allow us to directly measure the extent to which instructor preferences are satisfied," the study said. "It is at least theoretically possible that LLMs, although generally delivering stronger responses, still generate answers that are merely viewed as 'good enough.'"

The study comes as courts, law firms, and law schools increasingly grapple with how artificial intelligence should be used in the legal profession. Law firm Sullivan & Cromwell has admitted to a U.S. bankruptcy court that the machines are, slowly but surely, picking up the brief.

AI Beats Law Professors in Blind Legal Reasoning Tests

Share Article

Quick Info