The recent vision model from Cohere works on two graphics processors, overcomes the highest level of VLMS in visual tasks

Do you want smarter insights in your inbox? Sign up for our weekly newsletters to get what is essential for AI leaders, data and security. Subscribe now

The augment in deep research features and other AI analysis led to more models and services that want to simplify this process and read more documents that companies actually exploit.

Canadian Ai Company Cohere Banking its models, including a newly issued visual model, to justify that deep research functions should also be optimized in terms of cases of using the enterprise.

The company issued a vision command, a visual model specially focused on cases of using an enterprise, built at the back of its command. The parameter model of 112 billion can “unlock valuable observations based on visual data and make very accurate decisions based on data by recognizing optical characters (OCR) and image analysis,” says the company.

“Regardless of whether it is an interpretation of product instructions with complex schemes, or analyzing photos of scenes in the real world in terms of risk detection, prove the vision distinguished in dealing with the most demanding challenges of the vision of enterprises,” said the company In the post on the blog.

The AI Impact series returns to San Francisco – August 5

The next AI phase is here – are you ready? Join the leaders from Block, GSK and SAP to see the exclusive look at how autonomous agents transform the flows of the work of the company-decision-making in real time for comprehensive automation.

Secure your place now – the space is circumscribed: https://bit.ly/3guplf

This means that the vision command can read and analyze the most common types of images needed enterprises: charts, charts, diagrams, scanned documents and PDF.

? @cohere I just dropped the vision command @Huggingface ?
Designed for multimodal exploit cases for enterprises: interpretation of product instructions, photo analysis, asking about charts … ❓?
112b dense in vision language with efficiency sota-check comparative indicators in … pic.twitter.com/ormfm5f8cf
– Jeff Boudier? (@JeffBoudier) July 31, 2025

Because it is built on Command A, Command A Vision architecture requires two or less GPU, just like the text model. The VISION model also retains textual possibilities of commands and to read words in images and understands at least 23 languages. Cohere said that, unlike other models, commanding vision reduces the total cost of ownership for enterprises and is fully optimized for exploit for companies.

How architect recommends the architect

Cohere said it happened Llav architecture To build your command models, including a visual model. This architecture converts visual features into supple tokens, which can be divided into various tiles.

These tiles are transferred to the command of the text tower, “dense, 111b LLM text parameters,” said the company. “In this way a single picture consumes up to 3328 tokens.”

Cohere said that he trained a visual model at three stages: vision leveling, supervised tuning (SFT) and learning to strengthen after training with human feedback (RLHF).

“This approach allows you to map the function of an image encoder to the space of the language module,” said the company. “On the other hand, at the SFT stage, we simultaneously trained a vision encoder, vision adapter and language model on a variety of multimodal tasks.”

Visualization AI Enterprise

Comparative tests have shown that vision command is outweighted by other models with similar visual capabilities.

Cohere Petted prove a vision against OpenaiGPT 4.1, FinishCall 4 Maverick, MistralPixtral Enormous and Mistral Medium 3 in nine comparative tests. The company did not mention whether it tested the model against the API focused on OCR Mistral, Mistral OCR.

It enables agents to safely see in visual data of organizations, unlock the automation of tedious tasks including slides, diagrams, PDF and photos. pic.twitter.com/ihznuwekrk
– Cohere (@cohere) July 31, 2025

Recommend a vision of losing in other models in tests such as Chartqa, Ocrbench, AI2D and Textvqa. In general, Command A Vision had an average result of 83.1% compared to 78.6% GPT 4.1, Llama 4 Maverick 80.5% and 78.3% from Mistral Medium 3.

Most immense language models (LLM) are currently multimodal, which means that they can generate or understand visual media such as photos or movies. However, enterprises usually exploit more graphic documents, such as charts and PDF, so the separation of information from these unstructured data sources often turns out to be tough.

With the development of deep research on the introduction of models capable of reading, analyzing and even downloading unstructured data, increased.

Cohere also said that he offers a vision command in the open weight system, in the hope that enterprises wanting to move away from closed or reserved models will start using their products. So far the interest of developers.

Very impressed by the accuracy of extracting handwritten notes from the image!
– Adam Sardo (@sardo_adam) July 31, 2025

Finally, AI, who will not judge my terrible doodles.
– Martha Wisener? (@MartWisener) August 1, 2025

Daily observations in matters of business exploit with VB daily

If you want to impress your boss, VB Daily is covered by you. We give you an internal measure about what companies do with generative artificial intelligence, from regulatory changes to practical implementation, so you can share insights for the maximum roi.

Read our Privacy Policy

Thanks for the subscription. Check out more VB newsletter here.

There was a mistake.

Categories

The recent vision model from Cohere works on two graphics processors, overcomes the highest level of VLMS in visual tasks

How architect recommends the architect

Visualization AI Enterprise

The 7 best AI agent orchestration frameworks

‘Uncanny Valley’: Anthropic’s DOD Lawsuit, War Memes and Artificial Intelligence in VC Jobs

Google does not exclude advertising in Gemini

Technology is changing the way sleep apnea is treated

3 questions: About the future of artificial intelligence and mathematical and physical sciences

More News

‘Uncanny Valley’: Anthropic’s DOD Lawsuit, War Memes and Artificial Intelligence in VC Jobs

Google does not exclude advertising in Gemini

Nvidia will spend $26 billion to build open-weight artificial intelligence models, filings show

Iran warns that US technology companies could become targets as the war expands

The 7 best AI agent orchestration frameworks

‘Uncanny Valley’: Anthropic’s DOD Lawsuit, War Memes and Artificial Intelligence in VC Jobs

Google does not exclude advertising in Gemini