by Mrudula Kulkarni

8 minutes

The Data Bottleneck: Current Limitations of AI in Small Molecule Synthesis

AI drug discovery models are limited by data, not algorithms, publication bias, narrow chemical space, poor annotation.

The Data Bottleneck: Current Limitations of AI in Small Molecule Synthesis

Imagine a world where a computational model proposes a viable small molecule synthesis route in seconds, collapses years of bench work into an afternoon, and surfaces novel chemical space that no human team could traverse alone.

That world is close. But it is not yet here — and the reason is not the algorithm. It is the data.

Across the pharmaceutical industry, AI-powered tools for retrosynthesis prediction, reaction yield forecasting, and molecular property optimization have moved from academic curiosity to boardroom priority. Yet despite headline-grabbing partnerships and billion-dollar investments, the community increasingly acknowledges a hard truth: the models are only as good as the reactions that trained them.

This article examines — with scientific rigor and candor — where the data gaps lie, what the best research tells us, and what pharma leaders must do to navigate the bottleneck.


The Scale Illusion: Why "Big Data" in Chemistry Is Deceptively Small

What the Numbers Actually Show

Reaxys and SciFinder, the two dominant chemical reaction databases, together house roughly 15–17 million reactions accumulated over more than a century of published literature. This sounds large. By deep learning standards, it is modest.

For context, GPT-3 was trained on approximately 45 terabytes of text. The entirety of Reaxys in structured reaction format occupies a fraction of that volume — and, critically, is far less uniform in quality.



Database


Approx. Reactions


Primary Source


Known Bias


Reaxys

~14 million

Journals + patents

Publication bias toward successful reactions

CSD (Cambridge Structural Database)

~1.2 million crystal structures

Journals

Bias toward crystallizable compounds

USPTO (patent reactions)

~3.7 million

US patents

Commercially driven, narrow chemical space

ChEMBL

~2.2 million bioassay data points

Medicinal chem literature

SAR-focused, limited synthetic context


Table 1. Major chemical databases used in AI model training and their known limitations.

A landmark 2019 analysis in Nature Communications (Schwaller et al.) demonstrated that transformer-based models trained on USPTO data could predict reaction products with ~90% top-1 accuracy. But those same models failed substantially when tested on reactions from underrepresented chemical classes — precisely the novel scaffolds that medicinal chemists most need to explore.


The Three Core Data Problems

1. The Publication Bias Problem: Only Successes Make It to Print

Science publishes what works. A reaction that failed 47 times before succeeding on attempt 48 is typically reported in the literature as a clean, high-yielding transformation.

Negative reaction data — failed conditions, collapsed yields, unexpected byproducts — is systematically excluded from published literature. Yet this information is precisely what AI synthesis planning models require to learn decision boundaries.

A 2021 study in Science (Coley et al., MIT) found that models trained exclusively on "successful" reactions overestimated the feasibility of reactions involving sterically hindered substrates by 30–40%.


2. The Representation Problem: Chemical Space Is Poorly Sampled

The chemical universe of drug-like molecules is estimated at 10^60 compounds (Bohacek et al., Med. Res. Rev., 1996). Current databases cover perhaps 10^8 known compounds — a vanishingly small fraction.

More critically, what has been synthesized reflects historical priorities: the medicinal chemistry of the 20th century heavily favored a narrow set of reactions (amide coupling, reductive amination, Suzuki coupling, Buchwald-Hartwig amination) that together account for a disproportionate share of drug synthesis literature.

A 2022 analysis in Journal of Medicinal Chemistry (Brown & Boström) found that just 10 reaction types account for approximately 67% of bond-forming steps in published drug syntheses. AI models trained on this corpus are, consequently, biased toward recommending those same transformations — even when more creative routes would yield better molecular properties.



Reaction Type


% Frequency in Drug Synthesis Literature


Amide bond formation

22.4%

N-alkylation

8.9%

Buchwald-Hartwig amination

7.6%

Suzuki-Miyaura coupling

7.1%

Reductive amination

6.3%

All other reaction types

47.7%


Table 2. Frequency of reaction types in medicinal chemistry literature (adapted from Brown & Boström, J. Med. Chem., 2022).

This creates a feedback loop: AI recommends familiar reactions, chemists execute them, literature captures them, and AI trains on them.


3. The Annotation Quality Problem: Garbage In, Garbage Out

Even when reaction data exists, its machine-readability is often poor. A 2020 benchmark study in ACS Central Science (Schwaller et al.) found that over 20% of reactions extracted from patent databases contained errors in atom-mapping — meaning the AI could not correctly learn which atoms from reactants became which atoms in products.

Inconsistent solvent nomenclature, missing stoichiometry, absent temperature data, and non-standardized reagent descriptions compound the problem. A reaction run at "room temperature" in a Florida laboratory and at "room temperature" in a northern European facility may involve a 10–15°C difference — a difference that can be meaningful for sensitive transformations.


Retrosynthesis AI: Where the Bottleneck Bites Hardest

AI-driven retrosynthesis tools — including IBM RXN for Chemistry, AstraZeneca's REINVENT, and Recursion's platform — have demonstrated genuine utility for well-precedented chemical space. The challenge emerges at the frontier.

A 2023 benchmarking study in Nature Machine Intelligence (Tu & Coley) compared five leading retrosynthesis AI platforms against expert chemists on a diverse test set of 100 target molecules. Key findings:

  1. For molecules with high structural similarity to training data (Tanimoto coefficient > 0.6): AI matched expert chemists in route quality 71% of the time.
  2. For novel scaffolds (Tanimoto < 0.3): AI performance dropped to 34% route quality match.
  3. Expert chemists maintained approximately 68% quality for novel scaffolds — demonstrating that the data gap disproportionately handicaps AI relative to human expertise for novel targets.

The implication for drug discovery is direct: AI in synthesis is most useful when least needed (for known chemical space) and least useful when most needed (for novel, first-in-class targets).


Platforms like Schrödinger and Recursion already operate at the edge of this exact data problem in oncology.

→ Read: The Top AI Chemistry Platforms Transforming Small Molecule Oncology in 2026


The Assay Data Problem: Connecting Structure to Activity

Beyond synthesis prediction, AI-driven drug design depends on high-quality structure-activity relationship (SAR) data. Here, the data bottleneck takes a different but equally serious form.

High-throughput screening campaigns generate millions of data points — but they overwhelmingly test compounds in biochemical assays that may not reflect cellular or in vivo biology. A 2022 PLOS Computational Biology study estimated that fewer than 15% of published IC50 values are reproducible across independent laboratories, due to assay variability, cell line drift, and inconsistent compound handling.


Generative AI and the Hallucination Risk

Generative AI in pharma introduces a distinct data problem: the models can propose molecules that are synthetically inaccessible, despite appearing chemically reasonable on paper.

Large language models and graph neural networks trained on SMILES strings can generate novel molecular structures — but without hard constraints derived from synthetic feasibility data, they frequently propose routes that violate basic chemical logic or require reagents that do not exist commercially.

This is not a failure of the algorithm — it is a failure of training data to encode synthetic constraints sufficiently.


What the Best Organizations Are Doing: Emerging Solutions

Federated Learning and Pre-Competitive Data Sharing

The Pistoia Alliance and MELLODDY consortium (a 10-company consortium including Novartis, Janssen, and Bayer) have pioneered federated machine learning in chemistry approaches that allow models to train across proprietary datasets without exposing raw data. A 2021 paper in Nature Machine Intelligence reported that federated models outperformed single-company models on 57% of prediction tasks.


High-Throughput Experimentation (HTE) as Data Generators

Pfizer, Merck, and AstraZeneca have all published on the use of high-throughput experimentation platforms to systematically generate reaction condition data — including failed conditions — at scale. This creates curated, internally consistent datasets that bypass the publication bias of the literature.


Reaction Condition Prediction and Uncertainty Quantification

Groups at MIT (Coley lab) and ETH Zurich (Reymond group) are developing models that not only predict synthesis routes but also quantify uncertainty — flagging when a proposed step falls far outside the training distribution. This "know what you don't know" capability is increasingly viewed as essential for responsible deployment of AI synthesis planning tools.


Better data infrastructure is the fix.

Whether it actually moves the ROI needle is the harder, still-unanswered question.

→ Read: Quantifying the ROI: Do AI Chemistry Platforms Actually Reduce Pharma R&D Costs?


A Data Quality Framework for Pharma Leaders


Data Dimension


Current State


Target State


Key Action


Volume

Moderate (millions of reactions)

High

HTE programs, data partnerships

Negative data coverage

Very low (<5% of reactions)

Moderate (>20%)

ELN integration, pre-competitive sharing

Annotation consistency

Low to moderate

High

Standardized ontologies, automated curation

Chemical space diversity

Narrow (reaction-type biased)

Broad

Active learning, scaffold-diverse libraries

Reproducibility

Low (~15–30%)

High (>80%)

Assay standardization, inter-lab validation


Table 3. Data quality framework dimensions for AI-readiness in small molecule R&D.


The Road Ahead: Closing the Gap

The data bottleneck in AI small molecule synthesis is a solvable problem — but it will not be solved by any single company, algorithm, or database initiative alone. It requires a coordinated effort across academia, industry, and regulatory bodies to establish standards for data capture, incentivize negative data sharing, and deploy high-throughput experimentation at a meaningful scale.

The organizations that invest now in data infrastructure — standardising ELNs, generating HTE datasets, and participating in pre-competitive consortia — will hold a durable competitive advantage as AI pharmaceutical R&D matures. Those who wait for the models to improve without addressing the underlying data will find themselves training sophisticated algorithms on the same impoverished substrate.

The model is ready to learn. The question is whether we are ready to teach it properly.


Frequently Asked Questions

Q1. Can AI currently replace medicinal chemists in synthesis planning?

No. Current AI synthesis planning tools augment chemists for well-precedented reactions but consistently underperform human experts on novel chemical space. The data bottleneck is the primary reason. AI is best framed as a powerful co-pilot, not an autonomous navigator.


Q2. Is the lack of negative reaction data the single biggest limitation?

It is among the most impactful, but not the only one. Narrow chemical space coverage, annotation inconsistencies, and the disconnect between in vitro assay data and in vivo biology are equally significant challenges for AI in drug discovery.


Q3. How are federated learning approaches addressing data scarcity?

Federated learning allows multiple organizations to collaboratively train machine learning in chemistry models without sharing raw proprietary data. The MELLODDY consortium demonstrated measurable improvements in model performance using this approach, though significant governance and technical barriers remain.


Q4. What role does high-throughput experimentation play in solving the data problem?

HTE platforms can generate thousands of data points per week under controlled, consistent conditions — including systematically varied conditions that fail. This directly addresses both the volume and negative data gaps, making it one of the highest-ROI investments a pharma R&D organization can make in support of AI-driven drug design.


Q5. How should pharma leaders prioritize data investments for AI readiness?

Prioritize in this order: (1) ELN data standardization and machine-readable capture of internal experimental data, (2) HTE programs for systematic reaction condition data generation, (3) pre-competitive data-sharing consortia for areas of mutual non-competitive interest, (4) investment in uncertainty-quantifying AI architectures that communicate model confidence.

Author Profile

Mrudula Kulkarni

Managing Editor - Pharma Now

Comment your thoughts

Author Profile

Mrudula Kulkarni

Managing Editor - Pharma Now

Ad
Advertisement

You may also like

Article
Compliance In Pharma Manufacturing 2025: Key Regulations And Best Practices You Need to Know

Enoch Daniel