Sunday, March 8, 2026

The Bolmo architecture enables competent byte-level LM training without any quality loss

Share

Enterprises that want multilingual models without tokenizers are increasingly turning to byte-level language models to reduce the fragility of cacophonous or low-resource text. To exploit this niche and make it practical at scale, the Allen Institute for AI (Ai2) has introduced Bolmo, a recent family of models that utilize Olmo 3 models by “byting” them and reusing their framework and capabilities.

The company has released two versions, Bolmo 7B and Bolmo 1B, which are “the first fully open byte-level language model.” according to Ai2. The company stated that both models are competitive with – and in some cases even better – than other byte- and character-level models.

Byte-level language models operate directly on raw UTF-8 bytes, eliminating the need for predefined vocabulary or tokenizer. This allows them to more reliably deal with spelling errors, scarce languages, and unconventional text – key requirements for moderation, edge deployments, and multilingual applications.

For enterprises deploying AI across multiple languages, cacophonous user input, or in constrained environments, tokenizer-free models offer a way to reduce operational complexity. Ai2’s Bolmo is an attempt to put this approach into practice on a huge scale – without having to retrain from scratch.

How Bolmo works and how it was built

Ai2 said it trained Bolmo models using the Dolma 3 dataset, which helped train them Olmo’s flagship modelsand some open-source datasets and character-level data.

The company said its goal “is to provide a repeatable, testable blueprint for byting strong subword language models in a way that the community can adopt and extend.” To achieve this goal, Ai2 will share its checkpoints, code and full paper to support other organizations build byte-level models on top of the Olmo ecosystem.

Because completely training a byte-level model from scratch can be steep, AI2 researchers instead chose an existing Olmo 3 7B checkpoint to byte in two stages.

In the first stage, Ai2 froze the Olmo 3 transformer so that only specific parts, such as the local encoder and decoder, the boundary predictor, and the language modeling head, were trained. It was intended to be “cheap and fast” and requires just 9.8 billion tokens.

The next step unfreezes the model and trains it with additional tokens. Ai2 claims that the byte-level approach allows Bolmo to avoid the vocabulary bottlenecks that limit established subword models.

Good results among its competitors

Byte-level language models are not as popular as miniature language models or LLMs, but it is a growing field of research. Meta released its BLT architecture research last year that aimed to offer a model that is hearty, processes raw data, and does not rely on fixed vocabularies.

Other research models in this space include ByT5, Stanford MrT5AND Psi.

Ai2 assessed Bolmo using its evaluation suite covering math, STEM reasoning, question answering, general knowledge and code.

Bolmo 7B showed good performance, outperforming character-focused tests such as CUTE and EXECUTE, as well as improving accuracy compared to the entry-level LLM Olmo 3.

The Bolmo 7B outperformed comparably sized models in terms of coding, math, multiple-choice quality control, and character-level understanding.

Why enterprises can choose byte-level models

Enterprises see value in a hybrid model structure, using a combination of models and model sizes.

Ai2 says organizations should also consider byte-level models not only for robustness and multilingual understanding, but also because they “connect naturally with the existing model ecosystem.”

“The key advantage of a dynamic hierarchical configuration is that compression becomes a toggle knob,” the company said.

For enterprises that already utilize heterogeneous model stacks, Bolmo suggests that byte-level models may no longer be purely academic. By modernizing a mighty sub-vocabulary model rather than training from scratch, AI2 signals a lower-risk path for organizations that want robustness without abandoning existing infrastructure.

Latest Posts

More News