Thursday, May 8, 2025

Understanding visual knowledge of language models

Share

You’ve probably heard that a picture is worth a thousand words, but can a gigantic language model (LLM) get an image if it has never seen images before?

As it turns out, language models trained only on text understand the visual world well. They can write image rendering code to generate complicated scenes with intriguing objects and compositions – and even if this knowledge is not put to good operate, LLMs can improve their images. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) observed this by asking language models to self-correct their code for different images, with the systems correcting straightforward clip art drawings with each query.

Visual knowledge of these language models is gained from the way concepts such as shapes and colors are described on the Internet, either in language or code. After receiving a prompt like “draw a parrot in the jungle”, users launch LLM to consider what they have read in the descriptions before. To assess the level of visual knowledge of LLMs, the CSAIL team developed a ‘vision check’ for LLMs: using its ‘Visual Ability Dataset’, it tested the models’ ability to draw, recognize and self-correct these concepts. By collecting each final version of these illustrations, researchers trained a computer vision system that identifies the content of real photos.

“We are essentially training a vision system without directly using any visual data,” says Tamar Rott Shaham, co-author of the book test and a postdoc in Electrical Engineering and Computer Science (EECS) in CSAIL at MIT. “Our team checked language models to write image rendering codes to generate data for us, and then trained the vision system to evaluate natural images. We were inspired by the question of how visual concepts are represented through other media such as text. To express their visual knowledge, LLMs can use code as a common ground between text and vision.

To build this dataset, researchers first checked models to generate code for various shapes, objects, and scenes. They then compiled this code to render simple digital illustrations such as a row of bicycles, showing that LLMs understood spatial relationships well enough to be able to draw two-wheelers in a horizontal row. As another example, the model generated a car-shaped cake by combining two random concepts. The tongue model also produced a glowing bulb, indicating its ability to create visual effects.

“Our work shows that when you ask an LLM (without multimodal pre-training) to create an image, it knows much more than it seems,” says co-author, EECS PhD student and CSAIL member Pratyusha Sharma. “Let’s say you asked him to draw a chair. The model knows other things about another piece of furniture that it may not have rendered immediately, so users can query the model to improve the visual effect it produces with each iteration. Surprisingly, the model can iteratively enrich the drawing, greatly improving the rendering code.

The researchers collected these illustrations, which they then used to train a computer vision system that could recognize objects in real photos (even though they had never seen them before). With this synthetic text-generated data as the only reference point, the system outperforms other procedurally generated datasets that were trained using real photos.

The CSAIL team believes it may also be beneficial to combine the tacit visual knowledge of the LLM with the artistic capabilities of other AI tools such as diffusion models. Systems like Midjourney sometimes lack the knowledge to consistently correct the smallest details in an image, making it hard for them to handle requests such as reducing the number of cars depicted or placing an object behind another. If LLM had sketched the desired change to the diffusion model in advance, the resulting edit might have been more satisfactory.

The irony, as Rott Shaham and Sharma acknowledge, is that LLMs sometimes don’t recognize the same concepts they can draw. This became clear when the models incorrectly identified human image reconstructions in the dataset. Such diverse representations of the visual world likely caused the linguistic models to have false beliefs.

Although the models had difficulty perceiving these abstract representations, they showed creativity by drawing the same concepts in a different way each time. When researchers repeatedly asked LLMs to draw concepts such as strawberries and arcades, they produced images from different angles, shapes and colors, suggesting that the models may have actual representations of the visual concepts (rather than reciting examples they had seen before).

The CSAIL team believes this procedure could provide a basis for assessing how well a generative artificial intelligence model can train a computer vision system. Additionally, researchers want to expand tasks in which they question linguistic models. Regarding the recent study, the MIT group notes that they do not have access to the LLM training kit they used, making it hard to further investigate the origins of their visual knowledge. In the future, they intend to explore the possibility of training an even better vision model, allowing LLM to work with it directly.

Sharma and Rott Shaham join paper by former CSAIL affiliate Stephanie Fu ’22, MNG ’23, and EECS graduate students Manel Baradad, Adrián Rodríguez-Muñoz ’22, and Shivam Duggal, who are CSAIL affiliates; as well as MIT Associate Professor Phillip Isola and Professor Antonio Torralba. Their work was supported in part by a grant from the MIT-IBM Watson AI Lab, a LaCaixa Fellowship, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship. This week they are presenting their paper at the IEEE/CVF Computer Vision and Pattern Recognition Conference.

Latest Posts

More News