Saturday, March 7, 2026

The fresh model predicts how molecules dissolve in various solvents

Share

With the aid of machine learning, chemical engineers MIT created a calculation model that can predict how well any given molecule will dissolve in an organic solvent – a key stage of synthesis of almost every pharmaceutical. This type of forecast can significantly facilitate the development of fresh ways of producing medicines and other useful molecules.

Scientists say that the fresh model, which predicts how much dissolved substance will dissolve in a specific solvent, should aid chemists choose the appropriate solvent for any reaction in their synthesis. Common organic solvents include ethanol and acetone, and there are hundreds of others that can also be used in chemical reactions.

“Forecasting solubility is a really limiting degree in the stage of synthetic planning and production of chemicals, especially drugs, so there is a long -term interest to make better solubility forecasts,” says Lucas Attia, MIT graduate and one of the main authors of the fresh study.

Scientists have created their own model Available, and many companies and laboratories have already started using it. Scientists say that the model can be particularly useful for identifying solvents, which are less risky than some of the most commonly used industrial solvents.

“There are several solvents about which they dissolve most of things. They are really useful, but they are harmful to the environment and harm people, so many companies require to minimize the number of those solvents you use,” says Jackson Burns, a graduate of MIT, which is also the main author of the article. “Our model is extremely useful in identifying the next best solvent, which, I hope, is much less harmful to the environment.”

William Green, professor of chemical engineering Hoyt Hottel and director of MIT Energy Initiative, is an older author testwhich appears today. Patrick Doyle, professor of chemical engineering Robert T. Haslam, is also the author of the newspaper.

Solubility solution

The fresh model has grown from a project over which Attia and Burns worked together on the MIT course on the utilize of machine learning in problems with chemical engineering. Traditionally, chemists provide solubility using a tool known as the Solvatation Model of Abraham, which can be used to estimate the general solubility of the molecule, adding the share of chemical structures in the molecule. Although these forecasts are useful, their accuracy is constrained.

Over the past few years, scientists have started to utilize machine learning to try to make more correct solubility forecasts. Before Burns and Attia began working on their fresh model, the most current model of prediction of solubility was a model developed in the Green laboratory in 2022.

This model, known as Solprop, works by predicting a set of related properties and combining them, using thermodynamics, to ultimately predict solubility. However, the model has difficulty predicting solubility for dissolved substances that he has not seen before.

“In the case of pipelines of discovering drugs and chemicals in which you develop a new molecule, you want to be able to predict in advance what its solubility looks like,” says Attia.

One of the reasons why the existing solubility models did not work well is that there was no comprehensive set of data on which they can be trained. However, in 2023 a fresh set of data called Bigsoldb was issued, which developed data from almost 800 published articles, including information on the solubility for about 800 molecules dissolved about 100 organic solvents, which are widely used in synthetic chemistry.

Attia and Burns decided to try two different types of models on this data. Both of these models represent the chemical structures of molecules using numerical representations known as deposition, which contain information, such as the number of atoms in the molecule and which atoms are associated with which other atoms. Models can then utilize these representations to predict various chemical properties.

One of the models used in this study, known as fastprop and developed by Burns and others in the Green laboratory, contains “static deposition”. This means that the model already knows the embedding of each molecule before it begins to perform any analysis.

The second model, Chemprop, learns to embed for each particle during training, at the same time that he learns to associate the features of embedding with a feature such as solubility. This model, developed in many MIT laboratories, has already been used for tasks such as the discovery of antibiotics, the project of lipid nanoparticles and predicting the speed of chemical reaction.

Scientists have trained both types of models with over 40,000 BigSoldb data points, including information about the effects of temperature, which plays a significant role in solubility. Then they tested models with about 1000 dissolved substances that were detained from training data. They discovered that the models forecasts were two to three times more correct than Solprop forecasts, the previous best model, and the fresh models were particularly correct in predicting changes in solubility caused by temperature.

“The possibility of accurately reproducing these small differences in solubility due to temperature, even when the superior experimental noise is very large, it was a really positive sign that the network has correctly learned the function of predicting solubility,” says Burns.

Exact forecasts

Scientists expected a model based on Chemprop, which is able to learn fresh representation over time, would be able to make more correct forecasts. To their surprise, they discovered that both models were basically the same. This suggests that the main limitation of their efficiency is the quality of the data and that the models work, and theoretically possible depending on the data they utilize, say scientists.

“Chemprop should always exceed all static deposition when you have enough data,” says Burns. “We were surprised to see that static and learned were statistically learned indistinguishable in performance in all different subsets, which indicates that data restrictions present in this space dominated the performance of the model.”

Scientists say that models may become more correct if better training and testing data were available – preferably data obtained by one person or a group of people trained to conduct experiments in the same way.

“One of the large restrictions on the use of this type of compiled data sets is that different laboratories use different methods and experimental conditions when they carry out solubility tests. This contributes to this variability between different data sets,” says Attia.

Because the model based on fastprop accelerates forecasts and has a code that is easier for other users, scientists decided to share this, known as FastSolv, publicly. Many pharmaceutical companies have already started to utilize it.

“As part of the drug discovery pipeline, there are applications,” says Burns. “We are also excited that apart from the preparation and discovery of drugs in which people can use this model.”

The research was partly financed by the US Energy Department.

Latest Posts

More News