MLCommons, a nonprofit that helps companies measure the performance of their AI systems, is launching a novel benchmark that will also assess the downsides of AI.
A novel benchmark, the so-called AILluminatevaluates the responses of gigantic language models to over 12,000 test prompts in 12 categories, including incitement to violent crime, child sexual abuse, hate speech, promoting self-harm, and intellectual property infringement.
Models are rated “poor”, “fair”, “good”, “very good” or “excellent” depending on their performance. Hints used to test models are kept secret to prevent them from turning into training data that would enable the model to pass the test.
Peter Mattson, founder and president of MLCommons and a senior engineer at Google, says measuring the potential harm of AI models is technically hard, leading to inconsistencies across the industry. “Artificial intelligence is a really young technology, and artificial intelligence testing is a really young discipline,” he says. “Improving safety benefits society; it also benefits the market.”
Reliable, independent ways to measure AI risk may become more essential under the next U.S. administration. Donald Trump has promised to get rid of President Biden’s artificial intelligence executive order, which introduced measures to ensure companies apply artificial intelligence responsibly, as well as a novel AI Safety Institute to test powerful models.
These efforts could also provide a more international perspective on the harms caused by AI. MLCommons counts many international companies among its member organizations, including Chinese companies Huawei and Alibaba. If all of these companies used the novel benchmark, it would make it possible to compare AI security in the U.S., China and other countries.
Some gigantic AI vendors in the US have already used AILuminate to test their models. Anthropic’s Claude model, Google’s smaller Gemma model and Microsoft’s Phi model scored “very good” in tests. Both OpenAI’s GPT-4o models and Meta’s largest Llama model received “good” ratings. The only model to receive a “poor” rating was OLMo from the Allen Institute for AI, although Mattson notes that this is a research offering that was not designed with security in mind.
“Overall, it’s good to see scientific rigor in AI evaluation processes,” says Rumman Chowdhury, the company’s CEO Humanitarian intelligencea non-profit organization that specializes in testing or combining artificial intelligence models to detect inappropriate behavior. “We need best practices and inclusive measurement methods to determine whether AI models perform as we expect.”