In his 1927 work The Law of Comparative Judgment, American psychologist LL Thurstone proposed that when people choose one option among many alternatives, they choose the one that has the highest value to them, even if they cannot assign a specific number to that choice.
Thurstone was a pioneer of “psychometrics,” a field built on the idea that mental processes that we cannot see can nevertheless be measured and quantified. His 1927 paper laid the foundation for so-called random utility models, which provide a mathematical framework for describing human preferences – information that, in turn, can be relied on to predict various hypothetical situations.
Random utility patterns (RUM) are so named because they evaluate the “utility” or benefit that can be gained from a given choice – for example, deciding which book to read first among a stack of novels you bring from the library. “These models are inherently random,” explains Gabriele Farina, assistant professor in MIT’s Department of Electrical Engineering and Computer Science (EECS) and principal investigator at the Laboratory for Information and Decision Systems (LIDS), “because people are different. Everyone has their own preferences, and even those preferences can change from time to time.” For example, someone who usually chooses coffee over tea in the morning and prefers tea after lunch may sometimes completely confuse this order.
Certainly, rums are often used in government and industry in situations with much more critical consequences than the choice of heated (or iced) drink. Models routinely lend a hand predict what people will choose in so-called counterfactual (“what if”) scenarios, such as: How will they get to work or school if a major thoroughfare is closed for construction? What routes and means of transport will they travel? Or, if the city suddenly receives a windfall of $20 million in revenue, how should those funds be spent to maximize the common good?
Considering that RUMs have been with us for almost 100 years and have become more and more sophisticated over time, you can imagine that there is little room for improvement at this stage. However, this is not the case.
AND paper presented in April at the International Conference on Representational Learning in Rio de Janeiro, Brazil, revealed fundamental facts showing that much more can be learned from these models than has traditionally been believed. The author of the article is Yeshwanth Cherapanamjeri, a former MIT postdoc currently at Nanyang Technological University in Singapore; Farina, also a principal faculty member at MIT’s Operations Research Center (ORC); Constantinos Daskalakis, Avanessians Professor of Computer Science at MIT and member of the MIT Computer Science and Artificial Intelligence Laboratory; and Sobhan Mohammadpour, an MIT PhD student in computer science working in LIDS and EECS.
The group’s findings stem, in part, from flaws in the way RUM is commonly estimated in practice that have persisted since Thurstone’s time. The data from which the models were estimated was largely drawn from so-called pairwise comparisons: choosing between A and B – whether it’s movies on Netflix, competing products on Amazon.com, news on Google, etc. – which one would you choose? One reason this approach has become so common, Daskalakis explains, is that “assigning an exact numerical rating such as 4.37 to the benefits of a single item is very difficult. Whereas comparing two things and deciding which one you like more is cognitively much easier.” But that’s the problem, he adds. “With this way of assessing people’s preferences, taking into account only two things at a time, it is impossible to find correlations between numerous choices.”
The standard way of using RUMs assumes that the tools derived from A and B are independent, but in fact they may be related and this is worth knowing. If someone running for elected office discovers that a potential voter supports gun control, for example, there is a reasonable chance that that same person also supports government-sponsored child care. Similarly, an indie film fan may be partial to foreign films but less enthusiastic about Hollywood action blockbusters. “If a digital platform turns a blind eye to the existence of such correlations, it will not be able to estimate preferences very accurately,” notes Daskalakis. “And if Netflix regularly shows you an assortment of movies you’re not interested in, you can unsubscribe and cancel your subscription.”
The MIT team proved that it is impossible to obtain information about correlations from two-way comparisons alone. However, correlations can be seen when gigantic numbers of people rank three alternatives in order of preference. The same information can also be obtained by combining best-of-three and best-of-two options. In practice, explains Mohammadpour, “you can ask a group of people to rate three items. Then you can use a method we have developed to combine these individual scores into one large model that can give us the bigger picture.”
According to Farina, their research efforts are focused on the computational side of RUM, developing algorithms that can extract preference information and determine how much data is needed to do so, or equivalently, how many experiments need to be performed. According to him, the good news is that effective algorithms can actually be used for this purpose. The number of experiments required does not boost exponentially with the number of entries in the catalog or database that are being reviewed.
“This paper represents a significant breakthrough,” comments Emma Frejinger, a computer scientist at the University of Montreal. “This mathematically proves why traditional data collection fails and shows that simply asking users for the best of three [choices] unlocks the ability to accurately train these powerful models. This discovery provides a very practical roadmap for collecting better data to drive more accurate optimizations.”
“Building utility models will remain a very active area,” emphasizes Daskalakis. “Just as RUMs have played a key role in the internet economy since the late 1990s, they have and will remain key to adapting AI models in the future.” More importantly, he adds, “RUMs play a key role in the commercial viability and usability of gigantic language models [LLMs]” During the training period, people are typically asked to rank various LLM candidate scores, from which the models can better assess what type of text – in terms of tone, style and content – is preferred.
Given that we are constantly “besieged by a huge sea of options in so many different fields,” Daskalakis says, “it is impossible to even ask people to provide all of their personal preferences for all possible scenarios. Instead, you can build a model that predicts what people think about various possible outcomes. And you have to constantly improve and update your model in an iterative process until, hopefully, you get good predictions.”
