Teach AI agents to ask better questions by playing ‘Battleship’

Share

In 2026, the hype around AI agents will be louder than ever before. These semi-autonomous programs can “think” and perform well-defined tasks in areas such as customer service and software development, typically using language models (LMs). However, fields such as medical diagnostics and scientific discovery require them to seek a wide range of solutions in the uncertain environments that LMs face.

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard University’s School of Engineering and Applied Sciences (SEAS) took a closer look at LMs to understand their main problems in high-stakes situations. Their test: “Battleship” is a classic guessing game that has helped cognitive scientists study how people search for information.

Scientists from CSAIL and SEAS added a twist by changing the game to ask and answer questions in natural language. In Collaborative Battleship, one participant is the “captain” who asks where the hidden ships are, while his or her teammate acts as the “observer,” answering these questions in real time.

First, the researchers asked more than 40 people to play together, collecting their yes and no questions and answers to build the “BattleshipQA” dataset. These results provided a helpful point of comparison as the team tested state-of-the-art LMs (such as GPT-5) and smaller models (such as Llama 4 Scout) in their game. Without training the models in advance, they found that the best LMs could “beat” humans at “Battleship” – that is, complete the game in fewer turns – but smaller systems were much less rational.

The main problem was that many models are simply not adept at asking useful questions. To get LM to ask questions in a way that reveals more information about the hidden ships, the researchers gave each model a Monte Carlo inference strategy that precisely measures the probability that different options will be correct for each answer. The result: AI models that can beat regular Battleship players at any scale.

Perhaps the most striking result was the gains of the Lamy 4 Scout. As a relatively diminutive LM, it only defeats humans 8 percent of the time. However, by improving its inference strategy, the model achieved an 82% win rate for “Battleship” compared to humans. This careful and proficient style of questioning also enabled the model to outperform the frontier model (GPT-5), at a cost of approximately 1 percent of its cost.

In addition to this improvement, the researchers reduced the gap between humans and LMs in answering questions. While GPT-5 was a reliable tracking tool, helping models complete games faster, smaller systems had a bad habit of giving incorrect answers about where ships were hidden. The models saw an average 15 percent raise in accuracy when they started turning questions into code that explicitly told them how to verify their answers (for example, allowing the model to quickly scan an area when asked if there was a ship).

“Modern language models are primarily optimized for answering complex queries, but it is less clear whether they learn to ask good questions,” says MIT graduate student and CSAIL researcher Gabriel Grand SM ’23, who is the lead author of the study paper about work. “Our work shows that asking information questions depends on the ability to predict and simulate the world. We found that when we give agents access to a ‘world model,’ they ask better questions and make discoveries more effectively.”

A huge change for the Champions League

The team focused primarily on getting LMs to ask better questions. When implementing Monte Carlo inference strategies, LMs consider potential guesses as individual particles. Those that seem more exact with each observer’s response will have more weight, sort of like game balls that inflate or deflate each turn. With this more calculated and adaptive approach, the captain was able to ask questions, which allowed much more information to be obtained from the observer.

The researchers then turned to the widely used Python programming language to support AI observers. Each question asked by the captain was automatically converted into a coded command. For example, a question like “Is there a ship in the first column that spans two rows?” turns into instructions for the LM observer to search the area and assess how wide the digital game element is. By giving the model clear directions in a language it understood particularly well, each system was much more likely to provide correct answers. For example, the lightweight GPT-4o-mini system saw a nearly 30 percent raise in performance, and even the gigantic Claude 4 Opus model jumped by about eight points.

“In this field, ‘automatic formalization’ strategies, in which LMs generate code to validate their solutions, have been very successful,” says senior author Jacob Andreas, associate professor of electrical engineering and computer science at MIT and principal investigator of CSAIL. “What’s most exciting about this work is that it opens the possibility of using these techniques to generate better solutions, primarily by improving the information exploration and gathering capabilities of LMs. We’re excited to extend this work from scientific fields to applications such as coding and math problem solving.”

Let’s play something else

But how would this approach work in other board games? The team tested their newly equipped LMs on “Guess Who?”, where gigantic and diminutive models skillfully whittled down 100 options to correctly guess which hidden character was chosen. Llama 4 Scout was successful 30 percent of the time, but after adjustments by Grand and his colleagues, it completed the task in more than 72 percent of runs. Meanwhile, GPT-4o increased from 62 percent to 90 percent. GPT-5 watched every game to ensure questions were answered as accurately as possible.

While LM has made promising progress in both games, there is still a lot of work to be done. For example, models still have difficulty answering elaborate questions compared to humans. OpenAI researcher, recent Harvard graduate and co-author Valerio Pepe adds that “GPT-5 can beat the average battleship player, and with our methods it does a hair better. However, for all models it is still difficult to beat experienced players, unlike chess, where even the best players cannot cope with artificial intelligence systems.”

The researchers’ findings show that AI agents have untapped potential for needle-in-a-haystack discoveries – navigating a expansive space of options to find a infrequent solution to scientific challenges. While improved information retrieval skills would make them excellent research assistants, for example identifying the molecular structure of a compound, scientists caution that the “Common Battleship” is a fairly basic testbed. They would like to test LM in more elaborate settings where systems need to consider many more options.

Grand also plans to collaborate with humans and AI models to see if they work better together. The models could also benefit from more refined game simulations, and with greater computational power, LMs would have more advanced inference capabilities to predict game evolution.

“As AI systems become increasingly agentic, the most difficult problems turn out to be social: tracking common ground, resolving misunderstandings, and adapting to different partners over time,” says Robert Hawkins, an assistant professor of linguistics at Stanford University, who was not involved in the paper. “This work elegantly captures these phenomena in a controlled, collaborative environment and provides compelling evidence that the real bottleneck for AI agents is not just the computation of optimal questions, but the pragmatic reasoning needed to get the most out of the answers.”

Grand and Pepe co-wrote the paper with two CSAIL principal investigators: MIT associate professor Jacob Andreas and MIT professor Joshua Tenenbaum. Their work was supported in part by the MIT Siegel Family Quest for Intelligence, the MIT-IBM Watson AI Lab, the FinTechAI@CSAIL initiative, a Sloan research fellowship, Intel, the Air Force Office of Scientific Research, the Defense Advanced Research Projects Agency, the Office of Naval Research, and the National Science Foundation. They presented their paper as an oral presentation at the International Conference on Learning Representations (ICLR) in April.

Latest Posts

More News