To train more advanced language models, researchers apply massive datasets that combine diverse information from thousands of online sources.
However, as these datasets are combined and recombined across multiple collections, critical information about their origins and the constraints on how they can be used often becomes lost or confused.
This not only raises legal and ethical concerns, but can also hurt model performance. For example, if a dataset is miscategorized, someone training a machine learning model for a specific task could unknowingly apply data that is not intended for that task.
Additionally, data from unknown sources may contain errors that cause the model to make unfair predictions when implemented.
To improve data transparency, a team of multidisciplinary researchers from MIT and elsewhere began a systematic audit of more than 1,800 text datasets on popular hosting sites. They found that more than 70 percent of these datasets omitted some licensing information, while about 50 percent contained information that was erroneous.
Based on these observations, they developed a user-friendly tool called Data Lineage Explorer which automatically generates easy-to-read summaries of the creators, sources, licenses, and permitted uses of a dataset.
“Tools like these can help regulators and practitioners make informed decisions about AI deployment and promote responsible AI development,” says Alex “Sandy” Pentland, MIT professor, leader of the Human Dynamics Group at MIT Media Lab, and co-author of a recent open access article about the project.
Data Provenance Explorer can facilitate AI practitioners build more effective models by allowing them to select training data sets that match their model’s intended purpose. In the long run, this could improve the accuracy of AI models in real-world situations, such as those used to evaluate loan applications or respond to customer inquiries.
“One of the best ways to understand the capabilities and limitations of an AI model is to understand the data it was trained on. When you have misattribution and ambiguity about where the data came from, you have a serious transparency problem,” says Robert Mahari, a graduate student in the MIT Human Dynamics Group, a Harvard Law School candidate, and a co-author on the paper.
Mahari and Pentland are collaborating on this paper with Shayne Longpre, a co-author and graduate student in the Media Lab; Sarah Hooker, who directs the Cohere for AI research lab; and others from MIT, the University of California at Irvine, the University of Lille in France, the University of Colorado at Boulder, Olin College, Carnegie Mellon University, Contextual AI, ML Commons, and Tidelift. The research is published today in .
Focus on refinement
Researchers often apply a technique called fine-tuning to improve the capabilities of a huge language model that will be deployed for a specific task, such as answering questions. To fine-tune, they carefully build curated datasets designed to improve the model’s performance for that one task.
The MIT researchers focused on these precise datasets, which are often developed by researchers, academic organizations, or companies and licensed for specific uses.
When crowdsourcing platforms aggregate such datasets into larger collections that practitioners apply to improve their solutions, it is common for some of the original license information to be left out.
“These licenses should be meaningful and enforceable,” Mahari says.
For example, if the licensing terms for a dataset are wrong or missing, someone might spend a lot of money and time developing a model that they later have to delete because some of the training data contains private information.
“People can end up training models without even understanding the capabilities, concerns, and risks of those models that ultimately emerge from the data,” Longpre adds.
To begin the study, the researchers formally defined data provenance as a combination of the acquisition, creation, and licensing heritage of a dataset, as well as its characteristics. They then developed a structured audit procedure to track the data provenance of over 1,800 text datasets from popular online repositories.
After discovering that more than 70 percent of these data sets contained “unspecified” licenses that omitted much information, the researchers worked backward to fill the gaps. Through their efforts, they reduced the number of data sets with “unspecified” licenses to about 30 percent.
Their work also revealed that the actual licenses were often more restrictive than those assigned by the repositories.
In addition, they found that nearly all of the dataset creators were focused on the global north, which could limit the model’s capabilities if it’s trained for deployment in another region. For example, a Turkish dataset created primarily by people from the United States and China may not contain any culturally relevant aspects, Mahari explains.
“We fool ourselves into thinking that data sets are more diverse than they actually are,” he says.
Interestingly, researchers also observed a keen raise in restrictions placed on datasets created in 2023 and 2024. This may be due to researchers’ concerns that their datasets could be used for unintended commercial purposes.
User-friendly tool
To facilitate others obtain this information without having to manually audit it, researchers created the Data Provenance Explorer. In addition to sorting and filtering datasets based on specific criteria, the tool allows users to download a data provenance card that provides a concise, structured overview of the characteristics of a dataset.
“We hope this will be a step not only toward understanding the situation, but also helping people make more informed decisions about the data they train on,” Mahari says.
In the future, the researchers want to expand their analysis to investigate the provenance of multimodal data, including video and speech. They also want to investigate how the terms of service on websites that serve as data sources are reflected in the datasets.
As their research progresses, the researchers also contact regulators to discuss their findings and the unique copyright implications of fine-tuning the data.
“From the very beginning, when people create and share datasets, we need transparency and information about the data’s provenance to make it easier for others to draw conclusions,” Longpre says.
“Many proposed policy interventions assume that we can correctly attribute and identify licenses associated with data, and this work first shows that we can’t, and then significantly improves the available provenance information,” says Stella Biderman, executive director of EleutherAI, who was not involved in the work. “In addition, Section 3 provides a significant legal discussion. This is very valuable for machine learning practitioners outside of companies large enough to have dedicated legal teams. Many people who want to build AI systems for the public good are currently quietly grappling with how to deal with data licensing because the internet is not designed in a way that makes it easy to determine data provenance.”