The intelligence of AI models is not what holds back enterprise implementations. It is primarily the inability to define and measure quality.
This is where AI judges are now playing an increasingly essential role. In AI assessment, a “judge” is an AI system that evaluates the performance of another AI system.
Judge Builder is Databricks’ judge creation platform that was first implemented as part of the company’s software Agent Bricks technology earlier this year. The framework has evolved significantly since it was first launched in response to direct user feedback and implementations.
Early releases focused on technical implementation, but customer feedback revealed that the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three main challenges: getting stakeholders to agree on quality criteria, gaining domain expertise from confined domain experts, and implementing assessment systems at scale.
“Model intelligence is usually not the bottleneck, models are truly intelligent,” Jonathan Frankle, chief artificial intelligence scientist at Databricks, told VentureBeat during an exclusive briefing. “Instead, it’s really about the question of how do we get the models to do what we want them to do, and how do we know if they did what we wanted them to do?”
AI Assessment’s “Ouroboros Problem.”
Judge Builder addresses what Pallavi Koppol, a Databricks scientist who led development, calls the “Ouroboros problem.” Ouroboros is an archaic symbol depicting a snake eating its own tail.
Using AI systems to evaluate AI systems creates a round-robin validation challenge.
“You want the referee to check if your system is good, if your AI system is good, but then your referee is also an AI system,” Koppol explained. “And now you say, well, how do I know this judge is good?”
The solution is to measure “distance to expert truth” as the primary scoring function. By minimizing the difference between how an AI judge evaluates results and how domain experts would evaluate them, organizations can trust these judges as scalable proxies for human evaluation.
This approach is fundamentally different from the customary one handrail systems or assessments based on single indicators. Instead of asking whether an AI score has passed general quality control or not, Judge Builder creates very specific evaluation criteria tailored to your organization’s expertise and business requirements.
It is also distinguished by its technical workmanship. Judge Builder integrates with MLflow and swift optimization tools and can work with any base model. Teams can version-control their referees, track performance over time, and deploy multiple referees simultaneously across different quality dimensions.
Lessons learned: building judges who actually act
Databricks’ work with enterprise clients has revealed three key takeaways that apply to anyone evaluating AI.
Lesson one: Your experts don’t agree as much as you think. When quality is subjective, organizations find that even their experts disagree on what constitutes an acceptable result. A customer service response may be factual but used in an inappropriate tone. The financial summary may be comprehensive but too technical for the target audience.
“One of the biggest lessons from this whole process is that all problems become people problems,” Frankle said. “The hardest thing is to translate an idea from the human brain into something unambiguous. And the hardest thing is that companies are not one brain, but many brains.”
The fix is to add aggregate annotations with inter-rater reliability checks. Teams describe examples in petite groups and then measure compliance before continuing. This allows misalignment to be detected early. In one case, three experts gave the same result a rating of 1, 5 and a neutral rating, before discussion revealed that they interpreted the rating criteria differently.
Companies using this approach achieve inter-rater reliability ratings of 0.6 compared to the typical 0.3 for third-party annotation services. Greater consistency translates directly into better referee performance because the training data contains less noise.
Lesson two: Divide unclear criteria into specific judges. Instead of one judge assessing whether a response is “relevant, factual and concise,” three separate judges should be created. Each of them focuses on a specific aspect of quality. This granularity matters because a needy “overall quality” score reveals that something is wrong, but does not indicate what needs to be fixed.
The best results are achieved by combining top-down requirements, such as regulatory constraints, stakeholder priorities, with bottom-up discovery of observed failure patterns. One client built a top-down accuracy rating system, but through data analysis, he discovered that the correct answers almost always returned the top two search results. This insight became a recent, production-friendly judge that could determine accuracy without having to apply fact-based labels.
Lesson three: You need fewer examples than you think. Teams can create solid judges from just 20-30 well-selected examples. The key is to choose edge cases that reveal disagreement, rather than obvious examples where everyone agrees.
“For some teams, we can do this process in as little as three hours, so it doesn’t take that long to find a good referee,” Koppol said.
Production results: from pilot projects to seven-figure implementations
Frankle shared three metrics that Databricks uses to measure Judge Builder’s success: whether customers want to employ it again, whether they are increasing their AI spending, and whether they are progressing further on their AI journey.
For the first metric, one client created a dozen judges after an initial workshop. “This client convinced over a dozen judges after we first walked them through a rigorous process using this platform,” Frankle said. “They really went to town on the referees and now they measure everything.”
For the second metric, the business impact is clear. “Many customers attended this workshop and spent seven-figure sums on GenAI at Databricks in a way they hadn’t done before,” Frankle said.
The third metric shows Judge Builder’s strategic value. Customers who were previously hesitant to employ advanced techniques such as reinforcement learning can now implement them with confidence because they can measure whether improvement has actually occurred.
“There are clients who have done very advanced things after previously their judges were not willing to do it,” Frankle said. “They went from doing a bit of quick engineering to reinforcement learning with us. Why spend money on reinforcement learning and why waste energy on reinforcement learning if you don’t know if it actually made a difference?”
What companies should do now
Teams that successfully take AI from pilot to production treat judges not as disposable artifacts, but as evolving assets that grow with their systems.
Databricks recommends three practical steps. First, focus on high-impact judges by identifying one critical regulatory requirement and one observed failure mode. These become your initial judging portfolio.
Second, create lightweight workflows with subject matter experts. A few hours of reviewing 20-30 Edge cases provides sufficient calibration for most judges. Employ aggregate annotations and inter-rater reliability checks to denoise your data.
Third, schedule regular judging assessments using production data. As the system evolves, recent failure modes will emerge. Your referee portfolio should evolve with them.
“Evaluation is a way to evaluate the model, it’s also a way to create guardrails, it’s also a way to get a metric against which you can do rapid optimization, and it’s also a way to get a metric against which you can do reinforcement learning,” Frankle said. “Once you have a judge that you know represents your human taste in an empirical form that you can ask about as often as you want, you can use it in 10,000 different ways to measure or improve your agents.”
