MIT's Machine Learning Model Predicts Molecular Solubility, Accelerating Drug Development

The models had likely reached the limits of what the available data could teach them.

Both machine learning architectures performed identically, suggesting data quality, not algorithm design, was the constraining factor.

For decades, the question of which solvent will dissolve a given molecule has been a quiet bottleneck in drug manufacturing — a small decision with enormous consequences for whether a medicine ever reaches a patient. Two MIT graduate students have answered it more precisely than anyone before, building a freely available machine learning model that is two to three times more accurate than its predecessor and already in use across the pharmaceutical industry. Their work is a reminder that progress often comes not from chasing novelty, but from gathering enough honest data to finally see a familiar problem clearly.

Solvent selection has long been a costly guessing game in drug synthesis — get it wrong and a reaction fails, get it right and a new medicine moves closer to patients.
Previous prediction methods were imprecise, and machine learning approaches stalled for years simply because no comprehensive training dataset existed.
The 2023 release of BigSolDB — compiled from nearly 800 papers and over 40,000 data points — gave researchers the raw material to finally train a model worth deploying.
FastSolv outperformed all prior benchmarks, but a surprising finding emerged: two very different model architectures performed identically, revealing that data quality, not algorithm design, is now the true ceiling.
Pharmaceutical companies are already using the tool, and its ability to identify less toxic substitute solvents positions it as a force for cleaner, safer industrial chemistry.

At MIT, graduate students Lucas Attia and Jackson Burns set out to solve a problem that has quietly constrained drug manufacturing for decades: predicting which solvent will best dissolve a given molecule. Before any drug can be synthesized, chemists must choose a solvent — the medium in which reactions occur. The wrong choice means failure or prohibitive cost. The right one clears a major hurdle toward bringing a medicine to market.

Working in William Green's lab, Attia and Burns built FastSolv, a machine learning model that predicts how much of any molecule will dissolve in hundreds of common organic solvents. It is two to three times more accurate than SolProp, the previous best method — also developed in Green's lab just three years earlier. Multiple pharmaceutical companies have already begun using it.

The breakthrough depended on data that hadn't existed until recently. Earlier solubility predictions relied on the Abraham Solvation Model, which tallied contributions from different chemical structures within a molecule. It worked imperfectly, and machine learning couldn't do better without a comprehensive training set. That changed in 2023 with the release of BigSolDB, compiled from nearly 800 published papers and covering roughly 800 molecules across more than 100 solvents. Attia and Burns trained on more than 40,000 data points, including temperature effects.

They tested two architectures — one using fixed molecular representations, one that learned its own during training. Conventional wisdom favored the latter. Both performed identically. The finding pointed to something important: the models had reached the limits of what the available data could teach them. The bottleneck wasn't the algorithm — it was data quality. Different labs use different methods, introducing noise into any compiled dataset. Better predictions, Attia noted, would require measurements taken by a single team under identical conditions.

The team released FastSolv freely, and the decision has already paid off beyond drug discovery. Some of the most effective industrial solvents are also the most hazardous — damaging to workers and the environment. FastSolv helps chemists find the next-best option: nearly as effective, far less toxic. Published in Nature Communications in August 2025, what began as a course project has become an active tool across the pharmaceutical industry — proof that solving a concrete, long-lived problem can matter more than chasing the cutting edge.

At MIT, a pair of graduate students set out to solve a problem that has quietly constrained drug manufacturing for decades: figuring out which liquid will best dissolve a given molecule. The answer matters more than it might sound. Before any drug can be synthesized, chemists must choose a solvent—the medium in which chemical reactions happen. Get it wrong, and the reaction fails or becomes prohibitively expensive. Get it right, and you've cleared a major hurdle in bringing a new medicine to market.

Lucas Attia and Jackson Burns, working in William Green's lab at MIT, built a machine learning model that predicts solubility with striking accuracy. Their tool, called FastSolv, can tell a chemist how much of any given molecule will dissolve in any common organic solvent—ethanol, acetone, or hundreds of others used in industrial chemistry. The model is two to three times more accurate than the previous best method, a tool called SolProp developed in Green's lab just three years earlier. And unlike many academic tools that languish in papers, this one is already in use. Multiple pharmaceutical companies have begun deploying it.

The breakthrough rested on data that didn't exist until recently. For years, chemists predicted solubility using a method called the Abraham Solvation Model, which essentially adds up the contributions of different chemical structures within a molecule. It worked, but imperfectly. Machine learning promised better predictions, yet researchers lacked the raw material to train such models—a comprehensive dataset of how thousands of molecules actually behave in different solvents. In 2023, that changed. A dataset called BigSolDB was released, compiled from nearly 800 published papers and containing solubility information for roughly 800 molecules dissolved in more than 100 commonly used organic solvents. Attia and Burns trained their models on more than 40,000 data points from this collection, including information on how temperature affects solubility.

They tested two different architectures. One, called FastProp, used what researchers call static embeddings—numerical representations of molecular structure that the model knew before analysis began. The other, ChemProp, learned its embeddings during training, adapting its internal representations as it went. Conventional wisdom suggested ChemProp should win. It didn't. Both models performed essentially identically, and both crushed the previous benchmark. The surprise finding pointed to something important: the models had likely reached the limits of what the available data could teach them. The bottleneck wasn't the algorithm. It was the quality and consistency of the training data itself.

This limitation is real and specific. Different laboratories use different methods and equipment when measuring solubility. One group's experimental conditions differ from another's. These variations introduce noise into any compiled dataset. Attia noted that better predictions would require data obtained by a single team, all trained to perform experiments identically. That work hasn't been done yet.

Burns and Attia chose to release FastSolv—the faster of the two models, with code easier for others to modify and deploy. The decision to make it freely available has already paid dividends. Beyond drug discovery, the tool addresses a pressing concern in modern manufacturing: the need to replace hazardous solvents. Some solvents dissolve almost anything, making them invaluable in industry, but they damage the environment and harm workers. Companies increasingly face pressure to minimize their use. FastSolv helps chemists identify the next-best solvent, one that works nearly as well but carries far less toxicity. In that sense, the model does more than accelerate drug development. It opens a path toward safer, cleaner chemistry.

The work was published in Nature Communications in August 2025, with William Green and Patrick Doyle listed as senior authors. What began as a course project on applying machine learning to chemical engineering has become a tool in active use across the pharmaceutical industry—a reminder that sometimes the most useful innovations come not from chasing the cutting edge, but from solving a concrete problem that everyone in a field has been living with.

Notable Quotes

Predicting solubility really is a rate-limiting step in synthetic planning and manufacturing of chemicals, especially drugs.

— Lucas Attia, MIT graduate student

Our model is extremely useful in being able to identify the next-best solvent, which is hopefully much less damaging to the environment.

— Jackson Burns, MIT graduate student

The Hearth Conversation Another angle on the story

Inventor

Why does solubility matter so much in drug manufacturing? It seems like a technical detail.

Model

It's not technical—it's foundational. Before you can make a drug molecule, you need a liquid environment where the reaction can happen. Pick the wrong solvent and the reaction either doesn't work or costs ten times what it should. It's a bottleneck that affects every single pharmaceutical pipeline.

Inventor

So chemists have been guessing?

Model

Not guessing, but working with limited tools. The old method, the Abraham model, was like estimating a building's cost by adding up the price of each brick. It gives you a ballpark, but it's not precise. Machine learning promised better, but nobody had the data to train it on.

Inventor

Until BigSolDB.

Model

Exactly. Suddenly there was a dataset with 800 molecules and 100 solvents and 40,000 data points. That changed everything. But here's what surprised the researchers: two completely different model architectures performed identically. That shouldn't happen.

Inventor

What does that tell you?

Model

That the models have hit a ceiling. They're as good as the data allows them to be. The inconsistency between different labs' experimental methods is now the limiting factor, not the algorithm. You could have a perfect model and it still wouldn't be more accurate until someone runs standardized experiments.

Inventor

So the work isn't finished.

Model

It's finished in the sense that it's useful right now—companies are using it. But it's unfinished in the sense that there's a clear path to making it better. Someone needs to do the unglamorous work of running solubility tests the same way, over and over, in one lab.

Inventor

And the safer solvent angle—that's almost a side benefit?

Model

Not a side benefit. For a lot of companies, it's the main benefit. They're required to reduce their use of toxic solvents. This tool lets them find alternatives that work almost as well. That's not just chemistry. That's environmental and occupational health.

Want the full story? Read the original at Phys.org ↗

MIT's Machine Learning Model Predicts Molecular Solubility, Accelerating Drug Development

Notable Quotes

Related Coverage

Get The Register in your inbox