Thursday, December 26, 2024

Vast language models do not behave like humans, although we might expect them to.

Share

One of the things that makes gigantic language models (LLMs) so powerful is the variety of tasks they can be applied to. The same machine learning model that can aid a postgraduate student write an email can also aid a doctor diagnose cancer.

However, the wide applicability of these models makes them arduous to evaluate systematically. It would be impossible to create a benchmark data set to test the model on every type of question that could be asked of it.

IN new paperMIT researchers take a different approach. They argue that because humans decide when to deploy gigantic language models, evaluating a model requires understanding how humans form beliefs about its capabilities.

For example, a student must decide whether the model will be helpful in writing a particular email, or a clinician must determine in which cases it will be best to employ the model.

Based on this idea, the researchers created a model for evaluating LLM based on its correspondence to a person’s beliefs about how they will cope with a given task.

They introduce the human generalization function—a model of how people update their beliefs about the possibilities of an LLM after interacting with it. They then assess the extent to which LLMs are consistent with this human generalization function.

Their results indicate that when models are incompatible with the human generalization function, the user may be overconfident or uncertain about where to implement it, which can cause the model to fail unexpectedly. Furthermore, because of this incompatibility, more proficient models tend to perform worse than smaller models in high-risk situations.

“These tools are exciting because they’re universal, but because they’re universal, they’re going to work with people, so we have to take the human factor into account,” says study co-author Ashesh Rambachan, assistant professor of economics and principal investigator in the Laboratory for Information and Decision Systems (LIDS).

Rambachan was joined on the paper by lead author Keyon Vafa, a postdoc at Harvard University; and Sendhil Mullainathan, an MIT professor in the departments of electrical engineering and computer science and economics, and a LIDS member. The research will be presented at the International Conference on Machine Learning.

Human generalization

When we interact with other people, we form beliefs about what we think they know and what they don’t. For example, if your friend is picky about correcting other people’s grammar, you might generalize and think that he would also be great at constructing sentences, even though you’ve never asked him questions about sentence structure.

“Language models often seem so human. We wanted to show that this power of human generalization is also present in the way people form beliefs about language models,” Rambachan says.

As a starting point, researchers formally defined the human generalization function as asking questions, observing how a person or LLM model responds, and then drawing inferences about how that person or model would answer related questions.

If one observes that an LLM can correctly answer questions about matrix inversion, one can also assume that it can also obtain good answers to questions about elementary arithmetic. A model that is inconsistent with this function—one that does not do well on questions that a human would expect to answer correctly—is likely to fail when implemented.

With this formal definition in hand, the researchers designed a survey that aimed to measure how people generalized when interacting with LLMs and other people.

They showed survey participants questions that a given person or LLM answered correctly or incorrectly, and then asked them whether they thought the given person or LLM would answer the related question correctly. Through the survey, they generated a dataset containing almost 19,000 examples of how people generalize LLM scores across 79 different tasks.

Deviation measurement

Participants performed well in asking whether someone who answered one question correctly would also answer a related question, but they were significantly worse at generalizing LLM test results.

“Human generalization is applied to language models, but it doesn’t work because these language models don’t actually show patterns of competence like humans do,” Rambachan says.

People were also more likely to update their beliefs about LLM when he answered questions incorrectly than when he answered questions correctly. They were also more likely to believe that LLM’s performance on elementary questions would have little effect on performance on more sophisticated questions.

In situations where people placed greater weight on incorrect answers, simpler models performed better than very gigantic models such as GPT-4.

“Improving language models can actually trick people into thinking they can do well on related questions when in fact they can’t,” he says.

One possible explanation for why people are worse at generalizing with LLMs might be their novelty—people have much less experience interacting with LLMs than with other people.

“Going forward, it’s possible that we’ll get better simply by interacting with language models more often,” he says.

To this end, the researchers want to conduct additional research on how people’s beliefs about LLM evolve over time as they interact with the model. They also want to investigate how human generalization might be incorporated into the development of LLM.

“When we first train these algorithms or try to update them based on human feedback, we need to incorporate human generalization into how we think about measuring performance,” he says.

Meanwhile, the researchers hope their dataset can serve as a benchmark for comparing the performance of LLM models against human generalization function, which could aid improve the performance of models used in real-world situations.

“I think the contribution of the paper is twofold. The first is practical: The paper exposes a critical problem with implementing LLMs for general consumer use. If people don’t have a good understanding of when LLMs will be accurate and when they will fail, they are more likely to see errors and perhaps be discouraged from continuing to use them. This highlights the problem of matching models to people’s understanding of generalization,” says Alex Imas, a professor of behavioral science and economics at the University of Chicago Booth School of Business, who was not involved in the work. “The second contribution is more fundamental: the lack of generalization to expected problems and domains helps us get a better picture of what models do when they solve a problem ‘correctly.’ This provides a test of whether LLMs ‘understand’ the problem they are solving.”

This research was funded in part by the Harvard Data Science Initiative and the Center for Applied Artificial Intelligence at the University of Chicago Booth School of Business.

Latest Posts

More News