Thursday, March 12, 2026

AI21’s Jamba Reasoning 3B re -defines what “small” means in LLM – 250k context on laptop

Share

The latest addition to the petite wave of the model for enterprises comes Ai21 goodwho bet that the introduction of models into devices will sluggish down in data centers.

AI21 Jamba Reasoning 3B, “Small” Open Source model, which can run extended reasoning, code generation and react on the basis of ground truth. Jamba Reasoning 3B supports over 250,000 tokens and can draw inference on edge devices.

The company said that Jamba Reasoning 3B is working on devices such as laptops and mobile phones.

Ori Goshen, CEO AI21, said Venturebeat that the company sees more cases of using a company for petite models, mainly because the transfer of most of the application to devices releases data centers.

“What we see now in the industry is an economic problem in which there is a very expensive building of the data center, and revenues generated from data centers compared to the depreciation indicator of all their systems shows that mathematics is not adding up,” said Goshen.

He added that in the future “the industry would be hybrid in the sense that some calculations will be on devices locally and other applications will go to the GPU.”

Tested on a MacBook

Jamba Reasoning 3B combines architecture and mamba transformers to enable it to launch the 250K token window on devices. Ai21 said he could make 2-4x faster application speeds. Goshen said that Mamba’s architecture significantly contributed to the speed of the model.

The hybrid architecture of Jamba Reasoning 3b also allows you to reduce memory requirements, thus reducing its computing needs.

AI21 tested the model on a standard MacBook Pro and stated that it can process 35 tokens per second.

Goshen said that the model works best on tasks including calling function, generating rules and routing tools. He said that plain requests, such as asking for information about the upcoming meeting and asking the model to create this program, can be made on devices. More convoluted reasoning tasks can be saved for the GPU clusters.

Petite models in the enterprise

Enterprises were interested in using a mixture of petite models, some of which are specially designed for their industry, and some of the condensed LLM versions.

In September Finish released Mobilellm-R1, family of reasoning models From 140 m to 950 m parameters. These models are intended for mathematics, coding and scientific reasoning, not chat applications. Mobilellm-R1 can operate on circumscribed calculation devices.

Google‘S Donut It was one of the first petite models that appeared on the market, designed for portable devices such as laptops and mobile phones. Since then, Gemma have been expanded.

Companies like Fico They also began to build their own models. Fico fired Petite Fico -oriented models and Fico sequences that will only answer questions about finance.

Goshen said that the substantial difference offered by their model is that it is even smaller than most models, but he can perform reasoning tasks without devoting speeds.

Comparative testing

In comparative tests, Jamba Reasoning 3B showed high performance compared to other petite models, including QWEN 4b, Finish“S lama 3.2b-3b and phi-4-mini z Microsoft.

He exalted all models in the IFBENCH test and the last human exam, although he took second place to Qwen 4 on MMLU-PRO.

Goshen said that another advantage of small models, such as Jamba Reasoning 3B, is that they are highly controlled and provide better privacy options to enterprises, because the inference is not sent to the server elsewhere.

“I think that there is a world where you can optimize customer needs and experience, and models that will be stored on devices are part of it,” he said.

Latest Posts

More News