One of the coolest things about generative AI models – both vast language models (LLM) and diffusion-based image generators – is that they are “non-deterministic.” This means that despite their reputation among some critics as “fancy autocorrect,” generative AI models actually generate their results by selecting from a distribution of most likely sequential tokens (units of information) to populate the response.
LLM Question: “What is the capital of France?” will ask him to try the probability distribution for France, capitals, cities, etc. to get the answer “Paris”. But the answer may be “The capital of France is Paris” or simply “Paris” or “Paris, although at one point it was Versailles.”
Still, those of us who operate these models frequently on a daily basis will find that sometimes their answers can seem annoyingly repetitive or similar. A common joke about coffee is repeated across generations of queries. Plot prompts generate similar arcs. Even tasks that should provide many plausible answers – such as naming US states – are usually confined to just a few. This phenomenon, known as mode collapse, occurs during post-training alignment and limits the usefulness of otherwise powerful models.
Especially when we operate LLM to generate novel innovative work in writing, communication, strategy or illustration, what we really want is for its products to be even more diverse than before.
Now A a team of researchers from Northeastern University, Stanford University and West Virginia University have developed a brilliantly straightforward method for allowing language and image models to generate a wider range of responses to almost any user prompt adding one straightforward sentence: “Generate 5 responses with corresponding probabilities, taken from the full distribution.”
The so-called method Verbalized sampling (VS), helps models like GPT-4, Claude, and Gemini generate more diverse and human-like results – without requiring retraining or access to internal parameters. This is described in A paper published in the online open access journal arxiv.org in early October 2025.
When prompted, the model does not default to displaying the safest and most common output. Instead, it verbalizes its internal breakdown by potential completions and samples across a broader spectrum of possibilities. This single-line change leads to a significant boost in the diversity of outcomes across multiple domains.
As Weiyan Shi, an assistant professor at Northeastern University and co-author of the paper, stated: wrote on X: “The potential of LLMs has not yet been fully unlocked! As our article shows, rapid optimization can be guided by considering how LLMs are trained and tailored, which can be proven theoretically.”
Why models fail – and how VS turns it around
According to the research team, the root cause of mode collapse lies not only in algorithms such as reinforcement learning from human feedback (RLHF), but in the structure of human preferences. People tend to rate more familiar or typical responses as better, which leads adults to choose “safe” rather than diverse responses when tuning in.
However, this bias does not erase the model’s core knowledge – it simply suppresses it. VS works by bypassing this damping. Rather than asking for a single most likely outcome, it encourages the model to reveal a set of plausible answers and their relative probabilities. This distribution-level prompting restores access to the richer variety present in the basic pre-training model.
Real-world performance on a variety of tasks
The research team tested verbal sampling in several common operate cases:
-
Innovative writing: When generating stories, VS increased diversity scores by as much as 2.1 times compared to standard prompts, while maintaining quality. One of the prompts – “No Goodbye” – featured formal breakup scenes under the direct prompt, but included narratives involving space events, hushed emails, and music interrupting mid-dance when prompted by VS.
-
Dialogue simulation: In persuasive dialogue tasks, VS enabled models to simulate human patterns such as hesitation, resistance, and changing one’s mind. The distribution of donation behavior in VS is better adapted to real human data compared to baseline methods.
-
Open quality control: When asked to calculate correct answers (e.g., naming US states), models using VS generated answers that were more consistent with the diversity of real-world data. They included a broader set of answers without sacrificing substantive accuracy.
-
Synthetic data generation: Used to generate math problems for training models, VS created more diverse datasets. This, in turn, improved downstream performance on competitive math tests, outperforming synthetic data generated through direct prompting.
Tunable diversity and better operate of larger models
A notable advantage of VS is its tunability. Users can set a probability threshold in the prompt to sample from the lower probability “tails” of the model distribution. Lower thresholds correspond to greater diversity. This tuning can be done with just the prompt text, without changing any decoding settings such as temperature or up-p.
In one test using the Gemini-2.5-Flash model, diversity in story writing steadily increased as the probability threshold dropped from 1 to 0.001. The graph accompanying the study showed that VS performed better with both direct and sequence-based prompts at all thresholds.
Interestingly, the method scales well depending on the model size. Larger models such as GPT-4.1 and Claude-4 showed even greater benefits from VS compared to smaller ones. Although smaller models saw benefits, the improvement in diversity was approximately 1.5-2 times greater in larger counterparts, suggesting that VS helps unlock more hidden capabilities in advanced models.
Implementation and availability
The Verbalized Sampling method is now available as a Python package:
pip install verbalized-sampling
The package includes integration with LangChain and supports a straightforward interface for sampling from a verbalized distribution. Users can also adjust parameters such as k (number of responses), thresholds and temperature depending on applications.
Colab’s vigorous notebook and documentation are available at enterprise-friendly Apache 2.0 license on GitHub at: https://github.com/CHATS-lab/verbalized-sampling
Practical tips and common problems
Although this method works for all major LLMs, some users may initially encounter denials or errors.
In such cases, the authors suggest using the system version of the template or referring to alternative formats listed on GitHub.
Some models interpret complex instructions as jailbreak attempts and refuse to comply unless the structure is clearer.
For example, prompting with system-level instructions like this increases reliability:
You are a helpful assistant. For each query, generate five responses under separate tags, each with a probability below 0.10.
This tiny change usually solves any problems.
A featherlight solution to a gigantic problem
Verbalized sampling provides a practical, time-based solution to a profound limitation in the behavior of contemporary language models. No model retraining or internal access required. It is not dependent on any model family. It improves not only the variety of results, but also their quality – assessed both by human evaluation and benchmarking results.
With growing interest in tools to enhance model creativity, VS is likely to see rapid adoption in fields such as writing, design, simulation, education, and synthetic data generation.
For users and developers frustrated by the sameness of LLM answers, the solution may be as straightforward as changing the question.
