Do you want smarter insights in your inbox? Sign up for our weekly newsletters to get what is essential for AI leaders, data and security. Subscribe now
The augment in deep research features and other AI analysis led to more models and services that want to simplify this process and read more documents that companies actually exploit.
Canadian Ai Company Cohere Banking its models, including a newly issued visual model, to justify that deep research functions should also be optimized in terms of cases of using the enterprise.
The company issued a vision command, a visual model specially focused on cases of using an enterprise, built at the back of its command. The parameter model of 112 billion can “unlock valuable observations based on visual data and make very accurate decisions based on data by recognizing optical characters (OCR) and image analysis,” says the company.
“Regardless of whether it is an interpretation of product instructions with complex schemes, or analyzing photos of scenes in the real world in terms of risk detection, prove the vision distinguished in dealing with the most demanding challenges of the vision of enterprises,” said the company In the post on the blog.
The AI Impact series returns to San Francisco – August 5
The next AI phase is here – are you ready? Join the leaders from Block, GSK and SAP to see the exclusive look at how autonomous agents transform the flows of the work of the company-decision-making in real time for comprehensive automation.
Secure your place now – the space is circumscribed: https://bit.ly/3guplf
This means that the vision command can read and analyze the most common types of images needed enterprises: charts, charts, diagrams, scanned documents and PDF.
Because it is built on Command A, Command A Vision architecture requires two or less GPU, just like the text model. The VISION model also retains textual possibilities of commands and to read words in images and understands at least 23 languages. Cohere said that, unlike other models, commanding vision reduces the total cost of ownership for enterprises and is fully optimized for exploit for companies.
How architect recommends the architect
Cohere said it happened Llav architecture To build your command models, including a visual model. This architecture converts visual features into supple tokens, which can be divided into various tiles.
These tiles are transferred to the command of the text tower, “dense, 111b LLM text parameters,” said the company. “In this way a single picture consumes up to 3328 tokens.”
Cohere said that he trained a visual model at three stages: vision leveling, supervised tuning (SFT) and learning to strengthen after training with human feedback (RLHF).
“This approach allows you to map the function of an image encoder to the space of the language module,” said the company. “On the other hand, at the SFT stage, we simultaneously trained a vision encoder, vision adapter and language model on a variety of multimodal tasks.”
Visualization AI Enterprise
Comparative tests have shown that vision command is outweighted by other models with similar visual capabilities.
Cohere Petted prove a vision against OpenaiGPT 4.1, FinishCall 4 Maverick, MistralPixtral Enormous and Mistral Medium 3 in nine comparative tests. The company did not mention whether it tested the model against the API focused on OCR Mistral, Mistral OCR.
Recommend a vision of losing in other models in tests such as Chartqa, Ocrbench, AI2D and Textvqa. In general, Command A Vision had an average result of 83.1% compared to 78.6% GPT 4.1, Llama 4 Maverick 80.5% and 78.3% from Mistral Medium 3.
Most immense language models (LLM) are currently multimodal, which means that they can generate or understand visual media such as photos or movies. However, enterprises usually exploit more graphic documents, such as charts and PDF, so the separation of information from these unstructured data sources often turns out to be tough.
With the development of deep research on the introduction of models capable of reading, analyzing and even downloading unstructured data, increased.
Cohere also said that he offers a vision command in the open weight system, in the hope that enterprises wanting to move away from closed or reserved models will start using their products. So far the interest of developers.
