Ai2's Modern Open Source AI Molmo Models Beat GPT-4o, Claude

Join our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more

This The Allen Institute for Artificial Intelligence (Ai2) today unveiled Molmoa family of state-of-the-art, open-source, multi-modal AI models that outperform leading competitors in multiple third-party benchmarks, including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5.

This allows the models to ingest and analyze images sent to them by users, much like leading proprietary foundation models do.

However, Ai2 also noted in the post on X that Molmo uses “1,000x less data” than its proprietary competitors — thanks to some clever novel training techniques described in more detail below and in a white paper published by the company founded by Paul Allen and led by Ali Farhadi.

Ai2 says the release underscores its commitment to open research by offering powerful models, complete with open weights and data, to the broader community – and of course, to companies looking for solutions they can fully own, control and customize.

It comes just after Ai2 released another open model, OLMoE, two weeks ago, which is an “expert blend” or combination of smaller models designed for cost-effectiveness.

Closing the Gap Between Open and Proprietary AI

Molmo consists of four main models with different sizes, parameters and capabilities:

Molmo-72B (72 billion parameters or settings – flagship model, based on Alibaba Cloud’s open source Qwen2-72B model)
Molmo-7B-D (“demo model” based on Alibaba’s Qwen2-7B)
Molmo-7B-O (based on Ai2’s OLMo-7B model)
MolmoE-1B (based on LLM’s expert-grade OLMoE-1B-7B benchmark, which Ai2 says “nearly matches GPT-4V’s performance in both academic and user-preference tests”).

These models achieve high performance in a range of third-party benchmarks, outperforming many proprietary alternatives. And all are available under the Apache 2.0 license, enabling virtually any type of research and commercial (e.g., enterprise-grade) utilize.

It is worth noting that Molmo-72B is a leader in academic assessments, achieving the highest score in 11 key benchmarks and coming in second in the user preference ranking, just behind GPT-4o.

Vaibhav Srivastav, a machine learning programming engineer at AI code repository Hugging Face, commented: about the release on Xemphasizing that Molmo offers an excellent alternative to closed systems, setting a novel standard for open, multimodal AI.

Molmo by @allen_ai – Open source SoTA Multimodal (Vision) language model, beating Claude 3.5 Sonnet, GPT4V and comparable to GPT4o?
They provide four model checkpoints:
1. MolmoE-1B, a mixed model of experts from 1B (busy) and 7B (total)
2. Molmo-7B-O, the most open 7B model
3.… photo:twitter.com/9hpARh0GYT
— Vaibhav (VB) Srivastav (@reach_vb) September 25, 2024

In addition, Google robotics researcher DeepMind Ted Xiao took to X praise the inclusion of pointing data in Molmo, which he says represents a breakthrough in visual grounding in robotics.

Molmo is a very thrilling multimodal version of the foundation model, especially for robotics. The emphasis on pointing data makes it the first open VLM optimized for visual grounding – and this can be clearly seen in the impressive performance in RealworldQA or OOD robotics perception! https://t.co/F2xRCzogcg photo: twitter.com/VHtu9hT2r9
— Ted Xiao (@xiao_ted) September 25, 2024

With this feature, Molmo can provide visual explanations and interact more effectively with the physical environment, something that most other multimodal models currently lack.

These models are not only capable, but also completely open, allowing researchers and developers to access cutting-edge technologies and build solutions based on them.

Advanced Model Architecture and Training Approach

Molmo’s architecture is designed to maximize performance and efficiency. All models utilize the OpenAI ViT-L/14 336px CLIP model as a vision encoder, which processes multi-scale, multi-cropped images into vision tokens.

These tokens are then projected onto the input space of the language model via a multilayer perceptron (MLP) coupler and grouped together to reduce dimensionality.

Molmo’s training strategy consists of two key stages:

Initial multimodal training: In this step, models are trained to generate captions using newly collected, detailed image descriptions provided by human annotators. This high-quality dataset, called PixMo, is a key factor in Molmo’s powerful performance.
Supervised tuning: These models are then refined on a diverse mix of datasets, including standardized academic tests and newly created datasets, which enable the models to tackle elaborate real-world tasks such as reading documents, visual reasoning, and even pointing.

Unlike many contemporary models, Molmo does not rely on reinforcement learning based on human feedback (RLHF), instead focusing on a carefully tuned training process that updates all model parameters based on their pre-training state.

Higher scores on key benchmarks

Molmo models have shown impressive results in numerous comparative tests, especially when compared to proprietary models.

For example, Molmo-72B scored 96.3 points in DocVQA and 85.5 points in TextVQA, outperforming both Gemini 1.5 Pro and Claude 3.5 Sonnet in these categories. It also outperforms GPT-4o in AI2D (the company’s own Ai2 benchmark, low for “One diagram is worth a dozen pictures“a dataset containing over 5,000 primary school science diagrams and over 150,000 extensive annotations)

The models also perform well on visual grounding tasks, with Molmo-72B achieving the highest performance on the RealWorldQA benchmark, making it particularly promising for applications in robotics and elaborate multimodal reasoning.

Open Access and Future Releases

Ai2 has made these models and datasets available on Hugging Face spacewith full compatibility with popular AI frameworks such as Transformers.

Open access is part of Ai2’s broader vision to foster innovation and collaboration across the AI community.

Over the next few months, Ai2 plans to release additional models, training code, and an extended version of the technical report, further expanding the resources available to researchers.

Those interested in exploring Molmo’s capabilities can now take advantage of the public demo and several model checkpoints available at Molmo Official Website.

VB Daily

Stay up to date! Get the latest news in your inbox every day

By subscribing, you agree to the VentureBeat Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occurred.

Categories

Ai2’s Modern Open Source AI Molmo Models Beat GPT-4o, Claude

Closing the Gap Between Open and Proprietary AI

Advanced Model Architecture and Training Approach

Higher scores on key benchmarks

Open Access and Future Releases

SoundCloud changes its tos after the AI

SpaceX is testing the ship’s repairs after a failure

Nurses run away from their workforce – or technical support planning?

The reason why the tone of Murderbot is felt

Trump Nvidia’s flattery is won by reversing AI chip limits and Huawei clamp

More News

GROK AI Elona Muska cannot stop talking about “white genocide”

Microsoft cuts off access to Bing search data because it moves to chatbots

Agent AI Google Deepmind dreams of algorithms beyond human knowledge

Airbnb is in middle -aged crisis mode

SoundCloud changes its tos after the AI

SpaceX is testing the ship’s repairs after a failure

Nurses run away from their workforce – or technical support planning?