Despite their impressive capabilities, immense language models are far from perfect. These AI models sometimes cause hallucinations by generating incorrect or unfounded information in response to a query.
Due to the hallucination problem, LLM answers are often verified by fact-checkers, especially if the model is implemented in crucial areas such as health care or finance. However, validation processes typically require humans to read long documents cited in the model, a task so tedious and error-prone that it may prevent some users from deploying generative AI models in the first place.
To lend a hand human reviewers, MIT researchers have created a user-friendly system that allows humans to verify LLM responses much faster. Using this tool, the so-called SymGenLLM generates responses with quotes pointing directly to a location in the source document, such as a given cell in the database.
Users hover over highlighted parts of a text response to see the data the model used to generate a given word or phrase. At the same time, unhighlighted portions show users which phrases require additional attention to check and verify.
“We give users the ability to selectively focus on parts of the text they should be more concerned about. Ultimately, SymGen can augment people’s confidence in a model’s response because they can easily take a closer look and make sure the information has been verified,” says Shannon Shen, an electrical engineering and computer science graduate student and co-author of the book article about SymGenie.
Through user research, Shen and his colleagues found that SymGen accelerates verification times by about 20 percent compared to manual procedures. By making it easier for people to verify model results, SymGen can help people identify errors in LLMs implemented in a variety of real-world situations, from generating clinical notes to summarizing financial market reports.
In the article, Shen is joined by co-author and fellow EECS student Lucas Torroba Hennigen; EECS graduate Aniruddha “Ani” Nrusimha; Bernhard Gapp, president of the Good Data Initiative; and senior authors David Sontag, EECS professor, member of the MIT Jameel Clinic and leader of the Clinical Machine Learning Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Yoon Kim, assistant professor at EECS and member of CSAIL. The results of the study were recently presented at a conference on language modeling.
Symbolic references
To aid in validation, many LLMs are designed to generate citations pointing to external documents along with linguistic responses for users to check. But these verification systems are typically designed as an afterthought, without taking into account the effort it takes to sift through multiple citations, Shen says.
“Generative AI aims to reduce the user’s time to complete a task. If you have to spend hours reading all these documents to see if the model says something reasonable, then the practical use of generations is less useful,” Shen says.
Scientists approached the validation problem from the perspective of the people who would do the work.
The SymGen user first provides LLM with data that it can refer to in its response, such as a table containing statistics from a basketball game. Then, instead of immediately asking the model to perform a task, such as generating a game summary based on this data, researchers perform an intermediate step. They prompt the model to generate a response in symbolic form.
With this prompt, each time the model wants to quote words in its response, it must save a specific cell from the data table that contains the information it is referencing. For example, if the model wants to quote the phrase “Portland Trailblazers” in its response, it will replace that text with the name of the cell in the data table that contains those words.
“Because we have this intermediate step that has text in a symbolic format, we can get really detailed references. You can say that for each piece of text in the results it exactly corresponds to that place in the data,” says Torroba Hennigen.
SymGen then resolves each reference using a rule-based tool that copies the appropriate text from the data table to the model response.
“This way we know it’s an exact copy, so there won’t be any errors in the part of the text that corresponds to the actual data variable,” Shen adds.
Improved validation
The model can produce symbolic responses because of the way it is trained. Large language models are fed with huge amounts of data from the Internet, and some data is saved in a “placeholder format” in which codes replace real values.
When SymGen asks the model to generate a symbolic response, it uses a similar structure.
“We design prompts in specific ways to leverage the capabilities of the LLM,” adds Shen.
During a user survey, most participants found SymGen easier to verify LLM-generated text. They were able to verify model responses about 20 percent faster than using standard methods.
However, SymGen is confined by the quality of the source data. The LLM may provide an incorrect variable and the human verifier may be none the wiser.
Additionally, the user must have the source data in a structured format, such as a table, to enter into SymGen. Currently, the system only works with tabular data.
In the future, researchers are improving SymGen to handle arbitrary text and other forms of data. With this capability, it can, for example, lend a hand verify parts of summaries of legal documents generated by artificial intelligence. They also plan to test SymGen with physicians to see how it can identify errors in AI-generated clinical summaries.
This work is funded in part by Liberty Mutual and the MIT Quest for Intelligence Initiative.