Ai2's recent Olmo 3.1 expands reinforcement learning training for stronger reasoning benchmarks

Share

The Allen Institute for AI (Ai2) recently released what it calls its most powerful tool yet model family, Olmo 3. However, the company continued to work on the models, expanding its reinforcement learning (RL) processes to create Olmo 3.1.

The recent Olmo 3.1 models focus on performance, transparency and control for enterprises.

Ai2 has updated two of the three versions of Olmo 2: Olmo 3.1 Think 32B, the flagship model optimized for advanced research, and Olmo 3.1 Instruct 32B, designed for instruction execution, multi-turn dialogue and tool utilize.

Olmo 3 has a third version, Olmo 3-Base for programming, comprehension and math. It also works well with further tuning.

Ai2 said that to upgrade the Olmo 3 Think 32B to Olmo 3.1, its researchers extended its best RL run with a longer training schedule.

“Following the original Olmo 3 launch, we have resumed our RL training cycle for the Olmo 3 32B Think, training for an additional 21 days on 224 GPUs with additional epochs on our Dolci-Think-RL dataset,” Ai2 said in blog entry. “This resulted in Olmo 3.1 32B Think, which delivers significant benefits in math, reasoning and instruction tests: improvements of over 5 points in AIME, over 4 points in ZebraLogic, over 4 points in IFEval and over 20 points in IFBench, as well as improved performance in coding and complex multi-step tasks.”

To get to the Olmo 3.1 Instruct, Ai2 said its researchers applied the recipe behind the smaller Instruct, 7B, to the larger model.

Olmo 3.1 Instruct 32B is “optimized for chat, tool use and multi-turn dialogue – making it a significantly more powerful sibling of Olmo 3 Instruct 7B and ready for real-world applications,” Ai2 said in write to X.

For now, the recent checkpoints are available on Ai2 Playground or Hugging Face, with API access coming soon.

Better performance in benchmarks

The Olmo 3.1 models performed well in benchmark tests, predictably beating out the Olmo 3 models.

The Olmo 3.1 Think outperformed the Qwen 3 32B models in the AIME 2025 benchmark and achieved results similar to the Gemma 27B.

Olmo 3.1 Instruct performed well against its open source counterparts, even beating models like Gemma 3 in the math test.

“As for Olmo 3.1 32B Instruct, it is a larger-scale, instruction-specific model built for chat, tooling, and multi-turn dialogue. Olmo 3.1 32B Instruct is our most powerful fully open chat model to date and, in our estimation, our most powerful 32B scale fully open instruction model,” the company said.

Ai2 has also upgraded its RL-Zero 7B models for math and coding. The company reported on X that both models benefited from longer and more stable training runs.

Commitment to transparency and open source

Ai2 previously told VentureBeat that it designed the Olmo 3 family of models to give enterprises and research labs greater control and understanding of the data and training that went into the model.

Organizations can add elements to the model’s data set and train it to also draw conclusions from what has been added.

This has long been a commitment of AI2, which also offers: a tool called OlmoTrace which tracks how LLM results correspond to training data.

“Together, Olmo 3.1 Think 32B and Olmo 3.1 Instruct 32B show that openness and performance can evolve together. By extending the flow of the same model, we continue to improve capabilities while maintaining end-to-end transparency of data, code, and training decisions,” said Ai2.

The AI Sckool

Categories

Ai2’s recent Olmo 3.1 expands reinforcement learning training for stronger reasoning benchmarks

Better performance in benchmarks

Commitment to transparency and open source

OpenAI delays ChatGPT ‘adult mode’ again

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change

Anthropic’s contract with the Pentagon is a warning to startups chasing federal contracts

When AI companies go to war, security gets left behind

More News

When AI companies go to war, security gets left behind

War with Iran threatens global chip supplies and the expansion of artificial intelligence

ByteDance’s artificial intelligence ambitions are hampered by computational limitations and copyright concerns

OpenAI banned military applications. The Pentagon tested its models through Microsoft anyway

OpenAI delays ChatGPT ‘adult mode’ again

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change