Wednesday, March 11, 2026

The modern “Markov Thinking” technique opens the way to reasoning based on millions of AI characters

Share

Mila researchers have proposed a modern technique that makes enormous language models (LLM) much more productive when performing intricate reasoning. Called Markovian thinkingthis approach allows LLM to engage in long reasoning without incurring the prohibitive computational costs that currently limit such tasks.

The team’s implementation, a framework called Delethink, splits the reasoning chain into fixed-size chunks, eliminating the scaling problem that plagues very long LLM responses. Preliminary estimates show that for a 1.5B model, this method can reduce training costs by more than two-thirds compared to standard approaches.

The square curse of long-chain reasoning

For LLM to solve a intricate problem, it often needs to generate a long series of intermediate “thinking” tokens, often called a chain of thoughts (CoT). In recent years, scientists have discovered that using reinforcement learning (RL) to train models to create longer CoTs (sometimes called LongCoTs) significantly improved their reasoning capabilities.

However, the standard method has a critical flaw: the AI’s “state” (the hint plus all the reasoning tokens it has generated so far during processing) grows with each modern reasoning token. For newfangled people transformer-based modelsthis means that the cost of computation explodes quadratically as the reasoning chain lengthens, making it prohibitively pricey to train models for very intricate tasks.

Most current attempts to manage these costs focus on reducing the amount of thinking in the model, implicitly favoring shorter solutions or ending the process sooner. While these methods provide some relief, Mila researchers still operate within the LongCoT framework and are therefore fundamentally confined by its quadratic nature.

Instead of trying to control computation growth, Mila created an RL environment that avoids the quadratic problem entirely. As co-author Amirhossein Kazemnejad explained, the goal is to enable features such as multi-week reasoning and scientific discovery. “This system (and the RL necessary to enable such capabilities) is not supported by the current LongCoT paradigm due to the quadratic computational cost,” he said.

Think in pieces with Delethink

The researchers’ solution is a paradigm they call the “Markov thinker,” in which the model reasons while keeping the size of the reasoning context window constant. The basic idea is to reconfigure RL to separate “how long the model thinks” from “how much context it has to process”. If done correctly, the Markov thinker turns the quadratic growth problem into linear computation and constant memory requirements for LLM reasoning.

The researchers put this paradigm into practice using Delethink, which forces the model to reason in a sequence of fixed-size chunks, e.g., 8,000 tokens at a time. Within each fragment, the model reasons in a normal way, using the classic attention mechanism. However, when it reaches the shard limit, the framework resets the context, creating a modern prompt containing the original query and a compact “carryover” from the previous shard. For example, the transfer may be the last few tokens of the previous part of CoT or a summary of the most significant results.

This rearrangement of the problem forces the model to learn how to embed a summary of its progress, or “textual Markov state”, into this transfer in order to continue its reasoning in the next passage. This solves a common concern about whether the model can remember significant details from earlier steps.

According to Kazemnejad, the model learns what to remember. “Through training, the model is forced to learn to maintain a task-critical state,” he explained. He added an significant clarification regarding practical application: The original input prompt is not modified, including documents and contextual data added to it. “Our approach focuses on the reasoning phase and does not modify the prompts,” he said.

Delethink in action

To test their approach, the researchers trained R1-Distill-1.5B with Delethink on a dataset of competition-level math problems and then evaluated it against several benchmarks. The model was trained to infer up to 24,000 tokens, but with fixed chunks of 8,000 tokens.

Researchers compared this to models trained with the standard LongCoT-RL method. Their findings indicate that the model trained with Delethink could reason up to 24,000 tokens and equaled or outperformed the LongCoT model trained with the same budget of 24,000 tokens in math tests. In other tasks, such as coding and PhD-level questions, Delethink also matched or slightly outperformed its LongCoT counterpart. “Overall, these results indicate that Delethink uses its thinking tokens as effectively as LongCoT-RL with reduced computational power,” the researchers write.

The benefits become even more apparent when scaling beyond your training budget. While models trained with LongCoT quickly reached their training limits, the model trained with Delethink continued to improve its performance. For example, some math problems were only solved after the model accounted for up to 140,000 tokens, well above the training budget of 24,000 tokens. This advantage of linear computing is significant in enterprise applications. The researchers estimate that training the model to an average think length of 96,000 tokens would require 27 months of H100-GPU for LongCoT compared to just 7 months for Delethink.

This efficiency extends directly to inference, which is a major operating expense for most enterprises. “Models trained on Markov thinking use the same reasoning style (delethink tracking) during testing, which provides the same advantages of linear computation and persistent memory after training,” Kazemnejad said. He gave a practical example: an AI agent could “debug a large code base and think for a long time… which obviously significantly reduces costs compared to the conventional LongCoT approach.”

Interestingly, scientists have found that ready-made reasoning models, even without special training, already demonstrate some ability to think in a Markoan way. This finding has direct practical implications for developers. “In practice, this means that – without Delethink-RL – these models can already support the Delethink tracking wrapper and perform competitively with LongCoT in our tested workloads,” Kazemnejad said.

Their experiments with larger models such as GPT-OSS 120B demonstrated solid performance with Delethink on a range of intricate tasks. This latent ability provides a mighty starting point for RL training, helping to explain why the method is so effective. “Altogether, these results suggest that Delethink is compatible with and scales with state-of-the-art models,” the researchers conclude.

The success of Markov thinking shows that “next-generation reasoning models can think with millions of symbols,” the researchers note. This opens the door to fundamentally modern AI capabilities that go beyond current limitations.

“Markovian thinking… opens the way for models that can ‘think’ over very long horizons, which we see as a necessary step towards ultimate scientific discoveries,” Kazemnejad said. “Our approach removes a key bottleneck and can enable training for tasks with a much longer time horizon, enabling next-generation capabilities.”

Latest Posts

More News