What Are Common Data Sourcing Patterns in Pharmaceutical Research (part 3)
Summary:
- Data in public databases is often heterogeneous and requires prior curation before use for AI training.
- Big pharma and biotech companies must implement well-designed data management strategies to maximize the use of internal experimental data and publicly available data.
Pharmaceutical Data Sourcing Patterns and collection has evolved from a focus on internal experimentation toward a multimodal approach. Besides internal data, companies leverage proprietary data, public repositories, real-world evidence from collaborative consortia, and other sources.
However, even the richest datasets remain exploratory if they cannot be integrated, validated, and reused. In an AI-driven drug discovery, data generation is not the biggest problem. The true bottleneck is its orchestration.
Challenges with Internal Experimental Data
Internal data is often referred to as a ‘minefield’ because technical mishaps and fragmented storage can significantly derail drug pipelines. Information is frequently trapped in disconnected systems, such as separate Electronic Lab Notebooks (ELNs) or departmental spreadsheets, preventing researchers from seeing a continuous view of a drug’s lifecycle. This is a typical data silo and fragmentation problem.
Another one is a lack of standardization. Variations in naming conventions (e.g., ‘aspirin’ vs. ‘acetylsalicylic acid’) and units of measurement across different labs cause errors when aggregating data for analysis. Furthermore, the structure of the generated datasets is often dictated by the requirements of a specific experiment, making them unsuitable for general use.
Big pharma also struggles with so-called ‘assay drift.’ They often mistakenly view decades of historical assays as an ML goldmine. In reality, changes in technicians, instruments, and software over time mean that historical values (such as IC50) may not be directly comparable to current results [1].
Databases typically store only summarized results, without the raw measurements or experimental context (e.g., exact protocols, plate maps) needed to validate or reproduce the findings. Moreover, high-dimensional data (genomics/proteomics) requires labor-intensive manual curation, which is highly susceptible to human error.
Federated Learning for Mutual Benefits
Data scattered across different organizations and with limited access does not facilitate the creation of robust databases. To overcome data-sharing hurdles related to intellectual property (IP), companies use Federated Learning (FL). Projects like MELLODDY enable multiple organizations to train AI models on combined datasets without exposing their raw, proprietary data to competitors [2].
When data is scarce (e.g., rare diseases), big pharma researchers use generative AI (GANs or VAEs) to simulate biological scenarios and expand existing datasets. But, this method does not guarantee optimal results and is prone to producing erratic predictions [3].
Utilizing Public Datasets Effectively
The open science movement, which advocates unrestricted access to scientific research and data, has significantly accelerated AI adoption in drug discovery. Public data provides the fuel for AI; however, it must be handled with discipline to be useful.
Openly shared high-throughput screening (HTS) data is frequently convoluted and heavily biased toward ‘active’ compounds. Effectively using this data requires researchers to implement resampling techniques or identify ‘negative’ datasets to ensure models can distinguish between hits and failures.
Transfer Learning (TL) is a primary strategy for addressing data scarcity. Models are first pre-trained on large public datasets to learn general chemical knowledge, then fine-tuned on smaller, high-quality proprietary datasets for specific targets.
To integrate public data effectively, it must be made Findable, Accessible, Interoperable, and Reusable (FAIR principles). This requires using standardized ontologies (like OBI) to ensure the data is machine-readable.
Platforms such as the Therapeutics Data Commons (TDC) and Polaris offer AI-ready datasets and standard tasks, enabling teams to compare model performance against a common baseline.
Polaris has also introduced certification stamps identifying reliable datasets. Notably, the Polaris is an investment from major pharma players; the current steering committee includes representatives from Relay Therapeutics, Merck, Pfizer, Blueprint Medicines, Nimbus Therapeutics, AstraZeneca, Johnson & Johnson, Bayer, Novartis, and Valence Labs [4].
Data Source | Typical Downstream Limitations | Strategic Impact for AI Models |
|---|---|---|
High-Throughput Screening (HTS) | High false-positive rates (singleton testing); noise from optical interference or compound instability. | Requires extensive confirmatory assays and data curation to identify true hits for model training. |
Chemical Notations (SMILES/1D) | Syntactic fragility; lacks uniqueness (canonicalization varies); misses 3D spatial context like bond angles. | Generative models may produce chemically impossible 'garbage' outputs; shift toward SELFIES for robustness. |
Omics Data (Genomics/Proteomics) | 'Large p, small n' problem (high dimensionality vs. small samples); missing values (‘dropouts’); batch effects. | High risk of overfitting; requires deep autoencoders or manifold learning to impute missing signals. |
In Vitro Biological Assays | Highly conditional (depends on dose, genotype, and setup); cell line drift over time; inconsistent terminology. | Proxy measures often fail to translate to clinical in vivo efficacy or safety outcomes. |
In Vivo / Animal Models | Small-scale datasets (hundreds vs. millions); descriptive character leads to inconsistent terminology (e.g., 60 terms for 'kidney'). | Difficult for AI to infer robust relationships; results are often not generalizable to human physiology. |
Electronic Health Records (EHR) | 80% unstructured (clinician notes); missing long-term information on patient health; data silos. | Limits the ability to verify LLM benefits in real-life settings; requires human-in-the-loop validation. |
Historical Internal Data | Assay drift (shifting baselines due to tech changes); missing raw measurements or plate maps | Training on summarized values creates an unstable foundation; models may fail when lab protocols change. |
Scientific Literature / Patents | Unstructured and prone to AI hallucinations, Markush structures represent billions of potential molecules. | Requires specialized medical AI and Optical Chemical Structure Recognition to extract machine-readable data. |
Tab. 1. Typical downstream limitations associated with some data sources guiding strategic decisions in AI-driven drug discovery.
Why Many Datasets Remain Exploratory
A common pitfall in AI drug discovery is the accumulation of datasets that, despite their size, never progress beyond the exploratory phase. The primary reason for this stagnation is often the lack of contextual information accompanying the data. Without context, the data is of limited value because it severely compromises the ability to derive actionable insights and make informed decisions.
Data without context is like a map without a legend. It might contain a lot of information, but its meaning remains elusive. This highlights the critical importance of ensuring that data is accompanied by comprehensive metadata, including details about experimental conditions, patient demographics, and analytical methods.
Many datasets are generated without sufficient attention to standardization or metadata capture, rendering them unsuitable for integration into AI models or for validation across different teams or institutions. This underscores the need for a paradigm shift towards a more holistic approach to data management, one that prioritizes not only the collection of data but also its contextualization and annotation.
Checklist: Signals of Unusable Datasets
Evaluating the usability of datasets requires a discerning eye. One critical signal is the lack of standardized formats. If data is scattered across different systems, each employing unique terminology and coding schemes, integration becomes a Sisyphean task.
Another red flag is incomplete or inconsistent metadata. Without clear documentation detailing experimental conditions, patient demographics, and data processing methods, the data becomes difficult to interpret and validate, limiting its value for training AI models.
Moreover, beware of datasets with significant missing values or outliers. These can introduce bias and skew the results of AI-driven models, leading to inaccurate predictions. Prioritize datasets that are well-documented, standardized, and complete.
Best Practices for Data Governance in Drug Discovery
To ensure that AI initiatives in drug discovery are built on a solid foundation, organizations must embrace robust data management. A critical first step is to define a clear, AI-aligned data quality strategy with specific, measurable standards and quantify its business impact. This involves:
- mapping critical assets,
- establishing explicit ownership, validation rules, and stewardship structures within a unified data governance framework,
- capturing structured data through digital lab notebooks and automated Extract, Transform, Load (ETL) pipelines, supported by AI-driven data cleansing, to drive consistency and integrity,
- leveraging contextual data (glossaries, dictionaries, lineage) and multi-level data catalogues for understanding and accessibility,
- integrating structured and unstructured data (e.g., clinical notes, scientific literature, imaging data) across platforms using modern data architectures.
By adhering to these principles, we can create a data ecosystem where data is easy to find, access, and use for research and development.
Reflecting on the Importance of Robust Data
As we’ve explored, the path to successful AI employment in drug discovery hinges on more than data availability; it requires robust, high-quality datasets that are standardized, accessible, and well-annotated. Therefore, organizations should prioritize data management as a strategic imperative, investing in the tools, technologies, and talent needed to set free the full potential of AI in drug discovery and development.
As leaders in innovation, pharmaceutical companies must ask themselves some pressing questions.
- How can we overcome the challenges of data silos and ensure that data is shared more freely and openly?
- What steps can we take to standardize data formats and vocabularies to facilitate seamless integration across systems and platforms?
- How can we foster a culture of data quality and governance within the organization to ensure data is accurate, reliable, and trustworthy?
If you are unsure how to address them, we are ready to talk and help identify solutions that best fit the company, regardless of its data maturity stage. Our experts will lead you through the whole data journey with a dedicated strategy.
Discover Data Solutions
Author: Martyna Piotrowska
Technical editing: Ardigen expert: Ida Kupś
Bibliography
- Singh R, Making science run at the speed of thought: the reality of AI in drug discovery – Part 1. Drug Target Review. 2025 Nov 18 [cited 2026 Feb 5]. [Available from:] https://www.drugtargetreview.com/article/190632/making-science-run-at-the-speed-of-thought-the-reality-of-ai-in-drug-discovery-part-1/
- Heyndrickx W, Mervin L, Morawietz T, et al. MELLODDY: Cross-pharma federated learning at unprecedented scale unlocks benefits in QSAR without compromising proprietary Information. J of Chem Inf and Mod. 2024; 64 (7), 2331-2344. https://doi.org/10.1021/acs.jcim.3c00799
- Gangwal A, Ansari A, Ahmad I, Azad AK, Mohd Azizi W, Sulaiman W. Current strategies to address data scarcity in artificial intelligence-based drug discovery: a comprehensive review. Comput Biol Med. 2024 (179); 108734. https://doi.org/10.1016/j.compbiomed.2024.108734
- Alucozai M, Fondrie W, Sperry M. From data to drugs: the role of artificial intelligence in drug discovery. Wyss Institute Code to Cure. 2025 Jan 9 [cited 2026 Feb 5]. [Available from: ] https://wyss.harvard.edu/news/from-data-to-drugs-the-role-of-artificial-intelligence-in-drug-discovery/
- Alucozai M, Fondrie W, Sperry M. From data to drugs: the role of artificial intelligence in drug discovery. Wyss Institute Code to Cure. 2025 Jan 9 [cited 2026 Feb 5]. [Available from: ] https://wyss.harvard.edu/news/from-data-to-drugs-the-role-of-artificial-intelligence-in-drug-discovery/
- Moniz L, Gaspar M. Critical role of data quality in enabling AI in R&D. Deloitte UK Perspective. 2025 Nov 18 [cited 2026 Feb 5]. [Available from:] https://www.deloitte.com/uk/en/blogs/thoughts-from-the-centre/critical-role-of-data-quality-in-enabling-ai-in-r-d.html
- Data mine to data minefield: the hidden costs of poor data quality in biopharma R&D. Elucidata Blog. 2025 Jan 15 [cited 2026 Feb 5].[Available from:] https://www.elucidata.io/blog/data-mine-to-data-minefield-the-hidden-costs-of-poor-data-quality-in-biopharma-r-d
- O’Connell T. The value of unstructured data for drug companies. Pharmaphorum Blog. 2025 Sep 12 [cited 2026 Feb 5]. [Available from:] https://pharmaphorum.com/rd/value-unstructured-data-drug-companies