Responsibility and safety
Our comprehensive online benchmarking and ranking offer a much-needed measure of how thoroughly LLMs base their answers on the source material provided and avoid hallucinations
Enormous language models (LLMs) are changing the way we access information, but their factual accuracy remains imperfect. They may “hallucinate” false information, especially when given complicated inputs. This, in turn, can undermine confidence in LLMs and limit their real-world application.
Today we present FACTS Groundinga comprehensive benchmark for assessing the LLM’s ability to generate responses that are not only factually correct with respect to the input data provided, but also sufficiently detailed to provide satisfactory answers to user queries.
We hope our benchmark will spur industry-wide progress on facts and fundamentals. To track progress, we also run FACTS Leaderboard on Kaggle. We have already tested leading LLMs using FACTS grounding and have populated the initial leaderboard with their grounding scores. We will maintain and update the leaderboard as the field develops.
FACTS Grounding Dataset
To accurately assess the factuality and foundations of any LLM, the FACTS Grounding dataset contains 1,719 examples, each of which has been carefully crafted to require lengthy responses based on the context document provided. Each example includes a document, a system statement requiring LLM to refer only to the document provided, and an accompanying user request.
All examples are divided into a “public” set (860) and a “private” set (859). We are public kit release today so that anyone can operate it for LLM assessment. Of course, we know that you need to protect yourself from the problems of benchmark contamination and ranking hacking, which is why, in line with standard industry practice, we do not conduct private evaluation. FACTS rankings scores are the average scores across both public and private sets.
To ensure diversity in input, FACTS Grounding examples include documents of varying lengths, up to a maximum of 32,000 tokens (approximately 20,000 words), covering domains such as finance, technology, retail, medical, and law. User requests are similarly broad and include requests for summarization, generating questions and answers, and rewriting tasks. We did not include any examples that might require creativity, mathematics, or complicated reasoning – abilities that might require the model to operate more advanced reasoning in addition to grounding.
Collective assessment by leading LLMs
To be successful in a given example, LLM must synthesize the complicated information contained in the document and generate a long response that is both a comprehensive response to the user’s request and fully attributable to that document.
FACTS Grounding automatically evaluates model responses using three leading LLM judges – namely Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet. We chose a combination of different judges to mitigate the potential bias of a judge giving higher ratings to responses provided by a member of his or her own model family. The automated judge models were comprehensively evaluated against a set of tests to find the most effective prompt templates and verify consistency with human raters.
Each FACTS grounding example is assessed in two stages. First, responses are assessed for eligibility and disqualified if they do not match the user’s requests. Secondly, responses are rated as factual if they are fully based on the information contained in the document provided and do not contain hallucinations.
Because the eligibility and accuracy of a given LLM response are assessed separately by multiple AI judge models, the results are then aggregated to determine whether the LLM successfully handled the example. The final score for the overall grounding task is the average of the scores of all scoring models across all examples. Find more details about our FACTS grounding assessment methodology in our newspaper.
FACTS Grounding will continue to evolve
We recognize that progress can quickly outpace standards, so the launch of our FACTS Grounding benchmark and leaderboard is just the beginning. Reality and grounding are among the key factors that will shape the future success and usability of LLM and wider AI systems, and our goal is to develop and refine FACTS Grounding as the field progresses, continually raising the bar.
We encourage the AI community to do so commit to grounding FACTSevaluate your models on an open set of examples or submit your models for evaluation. We believe that comprehensive benchmarking methods combined with continuous research and development will allow for further improvement of AI systems.
Thanks
FACTS Grounding was hosted by: Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu and Nate Keating.
We also greatly appreciate contributions from: Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar , Rachana Fellinger, Rui Wang, Zizhao Zhang and Sasha Goldshtein.
We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu and Yossi Matias for their continuous support.