Join our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more
This The Allen Institute for Artificial Intelligence (Ai2) today unveiled Molmoa family of state-of-the-art, open-source, multi-modal AI models that outperform leading competitors in multiple third-party benchmarks, including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5.
This allows the models to ingest and analyze images sent to them by users, much like leading proprietary foundation models do.
However, Ai2 also noted in the post on X that Molmo uses “1,000x less data” than its proprietary competitors — thanks to some clever novel training techniques described in more detail below and in a white paper published by the company founded by Paul Allen and led by Ali Farhadi.
Ai2 says the release underscores its commitment to open research by offering powerful models, complete with open weights and data, to the broader community – and of course, to companies looking for solutions they can fully own, control and customize.
It comes just after Ai2 released another open model, OLMoE, two weeks ago, which is an “expert blend” or combination of smaller models designed for cost-effectiveness.
Closing the Gap Between Open and Proprietary AI
Molmo consists of four main models with different sizes, parameters and capabilities:
- Molmo-72B (72 billion parameters or settings – flagship model, based on Alibaba Cloud’s open source Qwen2-72B model)
- Molmo-7B-D (“demo model” based on Alibaba’s Qwen2-7B)
- Molmo-7B-O (based on Ai2’s OLMo-7B model)
- MolmoE-1B (based on LLM’s expert-grade OLMoE-1B-7B benchmark, which Ai2 says “nearly matches GPT-4V’s performance in both academic and user-preference tests”).
These models achieve high performance in a range of third-party benchmarks, outperforming many proprietary alternatives. And all are available under the Apache 2.0 license, enabling virtually any type of research and commercial (e.g., enterprise-grade) utilize.
It is worth noting that Molmo-72B is a leader in academic assessments, achieving the highest score in 11 key benchmarks and coming in second in the user preference ranking, just behind GPT-4o.
Vaibhav Srivastav, a machine learning programming engineer at AI code repository Hugging Face, commented: about the release on Xemphasizing that Molmo offers an excellent alternative to closed systems, setting a novel standard for open, multimodal AI.
In addition, Google robotics researcher DeepMind Ted Xiao took to X praise the inclusion of pointing data in Molmo, which he says represents a breakthrough in visual grounding in robotics.
With this feature, Molmo can provide visual explanations and interact more effectively with the physical environment, something that most other multimodal models currently lack.
These models are not only capable, but also completely open, allowing researchers and developers to access cutting-edge technologies and build solutions based on them.
Advanced Model Architecture and Training Approach
Molmo’s architecture is designed to maximize performance and efficiency. All models utilize the OpenAI ViT-L/14 336px CLIP model as a vision encoder, which processes multi-scale, multi-cropped images into vision tokens.
These tokens are then projected onto the input space of the language model via a multilayer perceptron (MLP) coupler and grouped together to reduce dimensionality.
Molmo’s training strategy consists of two key stages:
- Initial multimodal training: In this step, models are trained to generate captions using newly collected, detailed image descriptions provided by human annotators. This high-quality dataset, called PixMo, is a key factor in Molmo’s powerful performance.
- Supervised tuning: These models are then refined on a diverse mix of datasets, including standardized academic tests and newly created datasets, which enable the models to tackle elaborate real-world tasks such as reading documents, visual reasoning, and even pointing.
Unlike many contemporary models, Molmo does not rely on reinforcement learning based on human feedback (RLHF), instead focusing on a carefully tuned training process that updates all model parameters based on their pre-training state.
Higher scores on key benchmarks
Molmo models have shown impressive results in numerous comparative tests, especially when compared to proprietary models.
For example, Molmo-72B scored 96.3 points in DocVQA and 85.5 points in TextVQA, outperforming both Gemini 1.5 Pro and Claude 3.5 Sonnet in these categories. It also outperforms GPT-4o in AI2D (the company’s own Ai2 benchmark, low for “One diagram is worth a dozen pictures“a dataset containing over 5,000 primary school science diagrams and over 150,000 extensive annotations)
The models also perform well on visual grounding tasks, with Molmo-72B achieving the highest performance on the RealWorldQA benchmark, making it particularly promising for applications in robotics and elaborate multimodal reasoning.
Open Access and Future Releases
Ai2 has made these models and datasets available on Hugging Face spacewith full compatibility with popular AI frameworks such as Transformers.
Open access is part of Ai2’s broader vision to foster innovation and collaboration across the AI community.
Over the next few months, Ai2 plans to release additional models, training code, and an extended version of the technical report, further expanding the resources available to researchers.
Those interested in exploring Molmo’s capabilities can now take advantage of the public demo and several model checkpoints available at Molmo Official Website.