Tuesday, March 10, 2026

Recent learning method improves AI’s multimodal reasoning with smaller, smarter data sets

Share

This was published by scientists from MiroMind AI and several Chinese universities OpenMMReasonera recent training framework that improves the capabilities of language models for multimodal reasoning.

This process used a two-step process. First, it refines the base model with a curated dataset in a supervised fine-tuning (SFT) stage. Then, a reinforcement learning (RL) stage guides the model to reason more effectively on tasks involving both text and visual data.

Experiments show that models trained with OpenMMReasoner outperform other leading visual inference models, often when trained on a smaller, higher quality dataset. The framework and all its resources, including the trained 7B model, are fully open source, providing a reliable foundation for building applications that require traceability and reliability.

According to Kaichen Zhang, co-author of a research paper describing the recent method, OpenMMReasoner offers significant benefits to companies beyond enormous, closed systems. “The smaller open-source reasoning model has practical advantages: enterprises can deploy it locally, reduce latency, reduce token costs associated with long chains of thought, maintain full control over their data, and [it is] can be fine-tuned to suit a specific task later in the execution,” he told VentureBeat.

The challenge of clear multimodal reasoning

Recent advances in reinforcement learning with verifiable rewards (RLVR) have greatly improved the reasoning abilities of enormous language models (LLMs). RLVR trains LLM in generation chain of thought tokens (CoT) (which mimic the reasoning processes used by humans) before generating the final answer. This improves the model’s ability to solve elaborate reasoning tasks such as math and coding.

Motivated by this success, researchers have applied similar RL-based methods enormous multimodal models (LMM), showing that the benefits can extend beyond text and improve visual comprehension and problem solving across modalities.

However, the main obstacle was the lack of transparency in the training schedule. Many studies on multimodal inference do not provide detailed information about the data collection and training processes, making it arduous to reproduce their results or understand what makes these models work.

“This lack of openness limits reproducibility and hinders a deeper understanding of how reasoning LMMs are actually constructed and how their learning dynamics evolve,” the researchers note.

OpenMMReasoner recipe

OpenMMReasoner fills this gap with a fully clear and scalable training recipe built on open source LMM modules. Scientists have found that it is critical to create high-quality datasets by scaling the diversity of data. While using a variety of data sources is critical, the key improvement was to raise the variety of correct answers to the same question.

The first stage of the recipe is a three-stage supervised tuning (SFT) pipeline. It begins with data ingestion, during which the team collected approximately 103,000 raw question-answer pairs from public datasets covering general visual Q&A and reasoning tasks. Then they added the data distillation stageusing a powerful model (Qwen3-VL-235B-Instruction) to generate recent, high-quality reasoning traces for selected questions. (The data will then be used to train a smaller model.)

To raise the diversity of responses, the team generated multiple validated reasoning traces for each question. This expanded the dataset to 583,000 samples. Finally, a “domain mixing” phase was implemented, adding data from mathematical reasoning domains to further generalize the model’s capabilities, resulting in a final SFT dataset of 874,000 examples.

The second stage is the RL recipe, which uses a smaller dataset of 74,000 samples, selected from domains such as science, math and puzzles. The model is trained using a elaborate reward function that takes into account both the correctness of the final response and the consistency of the output format. To improve performance, the process includes an overthinking penalty, discouraging the model from generating too long responses (a problem with many RL-trained reasoning models that incorrectly learn to generate reasoning sequences that are too long, resulting in excessive overhead and slower responses).

This recipe can serve as a model for companies training their own models. “For companies with limited domain-specific data, a viable strategy is to first increase the diversity of responses in the existing dataset, and then use domain hashing to integrate this domain data into a general reasoning recipe like ours,” Zhang explained. “This allows the model to acquire strong general reasoning skills while adapting to industry-specific tasks without having to take millions of samples.”

A more proficient and competent reasoning model

According to Zhang, the step-by-step process fundamentally changes the reliability of the model’s results. “Traditional models often ‘jump’ directly to the answer, which means they only examine a narrow part of the reasoning space,” he said. “On the other hand, a reasoning approach forces the model to explicitly explore many intermediate steps… [allowing it] traverse much deeper paths and arrive at answers with much greater internal coherence.”

The researchers used an OpenMMReasoner recipe to generate data to tune the open-source Qwen2.5-VL-7B-Instruct vision language model. The result is a high-performance LMM that consistently outperforms state-of-the-art methods such as The Reasoner of Open Vision (OVR) on a wide range of multimodal inference benchmarks. The SFT stage itself creates a mighty base model that achieves better performance and data efficiency compared to other SFT approaches, despite using a much smaller training dataset.

The next phase of RL further sharpens and stabilizes these abilities, leading to more consistent and improved performance. After RL, the final model achieves state-of-the-art performance on several benchmarks, including WeMath, MathVerse, and MathVista.

One key finding was that as the model improved on multimodal reasoning, it also showed “a gradual emergence of textual reasoning behaviors, suggesting a transfer of reasoning competence from multimodal to purely linguistic domains,” the researchers note. This means that skills learned in one mode can improve performance in another.

“Our results show that strengthening multimodal reasoning can even improve text-only math skills — evidence that basic logical abilities are transferable between modalities,” Zhang said. “Looking ahead, we expect these methods to expand to video and audio.”

Researchers also found that token performance is key. Although allowing the model to generate longer inference steps can improve performance, excessive tokens reduce performance. The results show that setting a smaller “computation budget” can achieve comparable or even better accuracy, which is an critical factor when implementing cost-effective enterprise applications.

By open source all components During their work, researchers provide a repeatable picture of the entire process. For corporate teams, this transparency is invaluable. “For business leaders concerned about vendor lock-in, implicit bias, or opaque data sources, this level of transparency is essential,” Zhang said. “It enables teams to validate data, adapt the pipeline for new domains, and maintain long-term independence from any vendor.”

Latest Posts

More News