Try to take a photo of each of the North American countries more or less 11,000 tree species and you’ll have just a fraction of the millions of photos in nature image datasets. These huge collections of snapshots – starting from butterflies Down humpback whales — are an excellent research tool for ecologists because they provide evidence of organisms’ unique behaviors, uncommon conditions, migration patterns, and responses to pollutants and other forms of climate change.
While nature image datasets are comprehensive, they are not yet as useful as they could be. Searching through these databases and downloading the images that most closely match your hypothesis is time-consuming. You’d be better off with an automated research assistant – or perhaps artificial intelligence systems called multimodal visual language models (VLM). They are trained in both text and images, which makes it easier for them to see finer details, such as specific trees in the background of a photo.
But how much can VLMs assist wildlife researchers in image retrieval? To find out, a team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), University College London, iNaturalist, and other researchers developed a performance test. The task of each VLM: to locate and organize the most relevant results in the team’s “INQUIRE” dataset, consisting of 5 million wildlife photos and 250 search suggestions from ecologists and other biodiversity experts.
I’m looking for that special frog
Through these evaluations, researchers have found that larger, more advanced VLMs that are trained on much more data can sometimes provide researchers with the results they expect. The models performed quite well on elementary queries about visual content, such as identifying debris on a reef, but performed much worse on queries requiring expert knowledge, such as identifying specific biological conditions or behaviors. For example, VLM easily discovered examples of jellyfish on the beach, but struggled with more technical clues such as “green frog axanthism,” a condition that limits their ability to turn their skin yellow.
Their findings indicate that models need much more domain-specific training data to process hard queries. MIT PhD student Edward Vendrow, a CSAIL employee who co-led work on the fresh dataset paperbelieves that by learning more informative VLM data, they could one day become great research assistants. “We want to build search systems that provide exactly the results that scientists expect when monitoring biodiversity and analyzing climate change,” Vendrow says. “Multimodal models don’t fully understand the even more complex scientific language, but we believe INQUIRE will be an important benchmark for tracking progress in understanding scientific terminology and ultimately help researchers automatically find exactly the images they need.”
The team’s experiments showed that larger models tended to perform better on both simpler and more intricate searches due to the extensive training data. First, they used the INQUIRE dataset to see if VLMs could narrow down a pool of 5 million images to the 100 most relevant results (also known as “ranking”). For elementary queries such as “reef with artificial structures and debris”, relatively gigantic models such as “SigLIP” found matching images, while the smaller CLIP models struggled. According to Vendrow, larger VLMs are “just starting to be useful” for ranking more difficult queries.
Vendrow and his colleagues also assessed how well multimodal models could re-rank those 100 results, reorganizing which images were most relevant to searches. In these tests, even huge LLMs trained on more curated data, such as GPT-4o, struggled: its precision score was just 59.6%, the highest score achieved by any model.
The researchers presented these results earlier this month at the Neural Information Processing Systems (NeurIPS) conference.
Please make an INQUIRY
The INQUIRE dataset includes queries based on discussions with ecologists, biologists, oceanographers and other experts about the types of images they will look for, including unique physical conditions and animal behaviors. The annotation team then spent 180 hours searching the iNaturalist dataset using these hints, carefully combing through approximately 200,000 results to flag 33,000 matches that matched the hint.
For example, commenters used queries such as “hermit crab using plastic waste as a shell” and “California condor tagged with the green number ’26′” to identify subsets of a larger image dataset that depict these specific, rare events.
The researchers then used the same queries to test how well VLMs could retrieve iNaturalist images. Annotator labels revealed when models were having difficulty understanding scientists’ keywords because their results included images previously flagged as irrelevant to the search. For example, VLM results for “fire-scarred redwoods” sometimes included images of trees without any markings.
“This is a careful curation of data, focused on capturing real-world examples of scientific research across a variety of research areas in ecology and environmental science,” says Sara Beery, Homer A. Burnell Career Development Assistant Professor at MIT, CSAIL principal investigator and senior associate author of the work. “This has proven essential to expanding our understanding of current VLM capabilities in these potentially impactful scientific environments. It also outlines gaps in current research that we can now work on, particularly when it comes to complex queries around composition, technical terminology, and the small, subtle differences that delineate the categories of interest for our collaborators.”
“Our findings suggest that some vision models are already precise enough to help wildlife scientists find some images, but many tasks are still too difficult for even the largest and best performing models,” Vendrow says. “While INQUIRE focuses on monitoring ecology and biodiversity, the wide variety of queries means that VLMs that perform well in INQUIRE are likely to excel in analyzing gigantic image collections in other observation-intensive fields.”
Inquisitive minds want to see
Continuing their project, researchers are working with iNaturalist to develop a query system to better help scientists and other curious minds find the images they really want to see. Their job demonstration allows users to filter searches by species, allowing them to find relevant results more quickly, such as different cat eye colors. Vendrow and co-author Omiros Pantazis, who recently completed his PhD at University College London, also aim to improve the reranking system by refining current models to provide better results.
University of Pittsburgh Associate Professor Justin Kitzes highlights INQUIRE’s ability to discover secondary data. “Biodiversity datasets are quickly becoming too large for any scientist to sift through,” says Kitzes, who was not involved in the research. “This paper highlights the difficult and unresolved problem of effectively querying such data using questions that go beyond simply “who is here” and instead ask about individual characteristics, behavior, and interspecies interactions.” The ability to efficiently and accurately discover these more intricate phenomena in biodiversity imaging data will be critical for fundamental science and real-world ecological and environmental impacts.”
Vendrow, Pantazis and Beery wrote the paper with iNaturalist software engineer Alexander Shepard, University College London professors Gabriel Brostow and Kate Jones, University of Edinburgh associate professor and co-author Oisin Mac Aodha, and University of Massachusetts at Amherst assistant professor Grant Van Horn, who was senior co-author. Their work was supported in part by the Generative AI Laboratory at the University of Edinburgh, the US National Science Foundation/Natural Sciences and Engineering Research Council of Canada Global Center for Artificial Intelligence and Biodiversity Change, a research grant from the Royal Society, and the Biome Health Project funded by World Wildlife Fund United Kingdom.