Vast language models, such as those that support ChatGPT, have demonstrated impressive performance in tasks such as drafting legal documents, analyzing customer feedback, or translating documents into different languages.
These machine learning models typically employ only natural language to process information and answer queries, which can make it hard for them to perform tasks requiring numerical or symbolic reasoning.
For example, a multilingual model could memorize and recite a list of recent U.S. presidents and their birth dates, but the same model could fail if we asked the question “Which U.S. presidents elected after 1950 were born on a Wednesday?” (The answer is Jimmy Carter.)
Researchers from MIT and elsewhere have proposed a recent technique that enables enormous language models to solve natural language, mathematics and data science, and symbolic reasoning tasks by generating programs.
Their approach, called natural language embedded programs (NLEP), involves encouraging a language model to create and execute a Python program to solve a user’s query, and then generate a natural language solution.
They found that NLEPs enabled enormous language models to achieve greater accuracy on a wide range of reasoning tasks. This approach can also be generalized, meaning that one NLEP character can be reused for multiple tasks.
NLEPs also improve transparency because the user can check the program to see exactly how the model reasoned with the query and fix the program if the model gave an incorrect answer.
“We want AI to perform complex reasoning in a transparent and trustworthy way. We still have a long way to go, but we have shown that combining programming and natural language capabilities in large language models is a very good potential first step towards a future where humans can fully understand and trust what is happening in their AI model ” says Hongyin Luo PhD ’22, an MIT postdoc and co-author of the book article on NLEP.
Luo is joined in the article by co-authors Tianhua Zhang, a graduate of the Chinese University of Hong Kong; and Jiaxin Ge, a student at Peking University; Yoon Kim, assistant professor in MIT’s Department of Electrical Engineering and Computer Science and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); senior author James Glass, senior research fellow and head of the Spoken Language Systems Group at CSAIL; and others. The research results will be presented at the annual conference of the North American chapter of the Association for Computational Linguistics.
Solving problems using programs
Many popular enormous language models work by predicting the next word or token based on natural language input. Although models such as GPT-4 can be used to write programs, they embed these programs in natural language, which can lead to errors in program reasoning or results.
In the case of NLEP, MIT researchers took the opposite approach. They encourage the model to generate a step-by-step program entirely in Python code and then embed the necessary natural language into the program.
NLEP is a four-step problem-solving template. First, the model calls the necessary packages or functions that will be needed to solve the task. Step two is to import a natural language representation of the knowledge required to complete the task (e.g., a list of the birthdays of American presidents). In the third step, the model implements a function that calculates the response. In the final stage, the model generates the result in the form of natural language lines with automatic data visualization, if necessary.
“It’s like a digital calculator that always gives you the correct result as long as the program is correct,” Luo says.
The user can easily check the program and fix any errors in the code directly, without having to restart the entire model to fix the problem.
This approach is also more effective than some other methods. If a user has many similar questions, they can generate one base program and then replace specific variables without having to run the model multiple times.
To get the model to generate NLEP, researchers give it general instructions to write a Python program, provide two NLEP examples (one with math and one with natural language), and one test question.
“Typically, people doing these kinds of multi-shot prompts still have to design the prompts for each task. We found that we can have one prompt for many tasks because it is not a prompt teaching LLM to solve one problem, but a prompt teaching LLM to solve multiple problems by writing a program,” Luo says.
“Combining language models with code opens up many opportunities to leverage tools, validate results, have a more structured understanding of model capabilities and thinking, and more,” says Leonid Karlinsky, principal scientist at the MIT-IBM Watson AI Lab.
“There’s no magic here”
NLEPs achieved over 90 percent accuracy when they required GPT-4 to solve a series of symbolic reasoning tasks, such as keeping track of scrambled objects or a 24-player game, as well as following instructions and text classification. The researchers found that NLEP methods showed up to 30 percent greater accuracy than task-specific prompting methods. The method also showed improvements over open source LLM tools.
In addition to increasing the accuracy of enormous language models, NLEPs can also improve data privacy. Because NLEP programs run locally, sensitive user data does not have to be sent to companies like OpenAI or Google for processing by the model.
Additionally, NLEPs can enable better performance of diminutive language models without having to retrain the model for a specific task, which can be a costly process.
“There is no magic here. We don’t have a more expensive and fancy language model. All we do is generate programs instead of generating natural language, and we can make them perform much better,” Luo says.
However, NLEP relies on the program generation capabilities of the model, so the technique does not work as well for smaller models that have been trained on constrained datasets. In the future, researchers plan to explore methods that will allow smaller language models to generate more effective NLEPs. Additionally, they want to investigate the impact of rapid changes to NLEP on increasing the robustness of model inference processes.
This research was supported in part by the Hong Kong Center for Perceptual and Interactive Intelligence.