What Type Of Data Do You Need For AI Drug Discovery (part 2)

Abstract visualization of binary data representing AI model training in drug discovery

What Type Of Data Do You Need For AI Drug Discovery (part 2)

Summary:

  • Data types used for AI model training in drug discovery range from molecular to multimodal.
  • AI-ready data adheres to FAIR principles, is accurate, consistent, standardized, and well-labeled. Proper data governance also ensures that datasets comply with data security regulations.

You want to accelerate new therapeutics development workflows in your organization, and now what? What are those fantastic datasets that feed AI models, and where to find them? AI models learn from specific, structured representations of chemistry, biology, pharmacology, and clinical outcomes. Understanding what each data type contributes – and where it fails – is essential to building AI solutions that translate into new drug discoveries. 

Characteristics of AI-Ready Data

Before we start digging deeper into data types, let’s start with an aspect that applies to all data, regardless of type. Datasets must be thoroughly prepared before they can be ingested into models. To fully harness the potential of AI in drug discovery, data must have specific key characteristics.

  1. High-quality data that adheres to FAIR principles (Findable, Accessible, Interoperable, Reusable) is critical to reliable AI.
  2. Accurate, consistent, and well-labeled data is essential for deploying trusted AI models that perform as intended.
  3. Contextualized, standardized, and accessible data facilitates cross-team and institutional validation of findings.
  4. Robust data governance that meets all regulatory requirements ensures inherent compliance and traceability, increasing trust in AI-driven outputs.
  5. Trusted, standardized datasets enable cross-organizational model training and shared innovation.

Achieving AI-ready datasets requires close business-technology collaboration to define what is ‘good data’ and use this scope to establish the most critical data and how it needs to be integrated to achieve the desired outcome.

The AI Drug Discovery Data Stack. Governance underpins all modalities; higher layers add translational signal but also complexity and risk.

Fig 1. The AI Drug Discovery Data Stack. Governance underpins all modalities; higher layers add translational signal but also complexity and risk.

Key Data Types Used in Drug Discovery Research

Drug discovery research relies on a complex interplay of chemical, biological, and clinical data to move a candidate from initial design to patient use. These data types provide the information needed to train artificial intelligence (AI) models and to guide experimental decisions.

There are several core data modalities used in this field:


1. Chemical and Molecular Data [1]
This foundational dataset focuses on the digital representation of physical matter, enabling AI models to understand and optimize molecular structures.

  • Treating chemistry as a language, researchers use string-based formats such as SMILES (Simplified Molecular-Input Line-Entry System) or SELFIES to encode molecular topology as linear sequences.
  • Molecules are often treated as mathematical two-dimensional graphs where atoms are nodes and bonds are edges, allowing graph neural networks (GNNs) to extract topological features.
  • Because a drug must fit precisely into a protein’s binding pocket, three-dimensional data on Euclidean coordinates, bond angles, and conformational space (the various shapes a molecule can take) are critical for predicting binding affinity.
  1. Biological and ‘Omics’ Data [2]
    These data types map complicated biological systems and how they respond to diseases or potential treatments.
  • Genomics data provides a blueprint of genetic variations and mutations at the root of the disease, which is essential for target identification and patient stratification.
  • Transcriptomics data measure gene expression (RNA levels) to inform how cellular activity changes in response to a drug.
  • Since proteins are the primary targets for most drugs, proteomics data on their 3D structures, abundance, and post-translational modifications are indispensable.
  • Metabolomics data offers a complete understanding of small-molecule metabolites and their dynamic, context-specific profiles.

     

  1. Pharmacological and ADMET Data [3]
    A major cause of drug failure is poor safety or absorption rather than lack of efficacy. ADMET properties cover Absorption, Distribution, Metabolism, Excretion, and Toxicity. The FDA Adverse Event Reporting System (FAERS) data can be a valuable source for training AI models.

     

Researchers then use these predictive models to evaluate endpoints such as blood-brain barrier penetration, metabolic stability (e.g., Cytochrome P450 inhibition), and cardiotoxicity (e.g., hERG channel blockade) before a compound is synthesized.

 

  1. Clinical and Real-World Data (RWD) in Pharma
    As development progresses toward human trials and commercialization, data focus shifts toward clinical outcomes and patient health.
  • Electronic Health Records (EHR) contain both structured data (diagnosis codes, lab values) and unstructured data (clinician notes, discharge summaries) that provide longitudinal trajectories of a patient’s health.
  • Claims and billing data (insurance records) offer insights into healthcare utilization, treatment costs, and long-term effectiveness in diverse populations.
  • Real-time data from digital health technologies (DHT), such as wearable devices (e.g., smartwatches) and mobile apps, enable continuous physiological monitoring during daily activities.

     

  1. Unstructured Knowledge Sources
    Approximately 80% of all healthcare data exists in unstructured formats that are not easily searchable.
    Peer-reviewed publications and journals are the primary sources of new chemical discoveries and biological insights. Patents describe protected chemical spaces, often using Markush structures, generic descriptions that can represent billions of potential compounds.

     

Synthesizing data from multiple sources and formats requires a well-thought-out process and skilled data preparation. This reduces bias and ensures that the data can be used effectively. Otherwise, the time spent training models on inadequately selected and prepared data will be wasted and a drain on resources. 

Integration Issues in AI-Driven Drug Discovery

One of the most significant hurdles in using AI for drug discovery is the pervasive lack of data standardization. Data from diverse sources typically exists in disparate formats without any schema, complicating integration and analysis. This incompatibility can lead to inconsistencies and errors in AI models, undermining the reliability of results.

Furthermore, the complexities of integrating diverse data sources hinder the development of innovative trial models, including virtual and decentralized designs. To overcome these challenges, it is imperative to establish robust data standards and protocols to enable seamless integration and interoperability across systems and platforms.

Regulatory bodies such as the FDA and EMA have outlined the direction for data management in AI training. In January 2026, they jointly published a good practice principles document that succinctly defines the recommendations:

Data source provenance, processing steps, and analytical decisions are documented in a detailed, traceable, and verifiable manner, in line with GxP requirements. Appropriate governance, including privacy and protection for sensitive data, is maintained throughout the technology’s life cycle.

For now, this is not a regulation, but it certainly points the way forward and may become a required standard in the near future.

Multimodal Data

These specialized data types represent a move toward high-resolution, granular information from multiple sources, enabling researchers to look beyond simple target-ligand interactions toward complex cellular systems. By integrating diverse sources, AI models can detect elusive patterns in disease progression and drug response that were previously hidden in siloed archives.

Here is a brief overview of how multimodal data types can be utilized in modern drug discovery research:

  1. High-Resolution Cellular Screening

Phenotypic screening data captures the functional effects of a drug on an entire biological system rather than on a single protein target. It is increasingly important for identifying scaffold-hopping opportunities and finding drugs that work through complex, multi-target mechanisms.

HCS imaging and Cell Painting assay data are advanced examples of phenotypic screening. High-content screening (HCS) generates large volumes of image-based data that AI models use to evaluate cellular changes. The Cell Painting assay specifically uses fluorescent dyes to label different cellular compartments, creating a fingerprint of the cell’s state.

  1. Advanced ‘Omics’ and Genetic Interaction
  • Unlike bulk measurements, single-cell RNA-seq data provides a granular view of cellular heterogeneity, allowing researchers to map specific gene regulatory networks across individual cells. AI helps automate the time-consuming process of cell-type annotation in these clusters.
  • Spatial omics, including spatial transcriptomics, allows scientists to explore gene expression patterns directly within tissue environments (which single-cell RNA-seq data lacks). This context is critical for understanding how the spatial arrangement of cells influences disease pathology.
  • AI is used for in silico design and for interpreting CRISPR screening data, helping researchers understand the effects of specific genetic perturbations on drug sensitivity and target validation.
  • Optical pooled screening data combines imaging with next-generation sequencing (NGS) to link cellular phenotypes directly to genetic variations in a high-throughput format.
  1. Clinical Infrastructure and Population Data
  • Clinical Trial Data Lake: Organizations like Novartis (through its data42 initiative) are integrating decades of clinical trial records, spanning thousands of trials and hundreds of indications, into centralized cloud repositories. This single source enables AI to answer complex research questions in seconds, rather than months.
  • UK Biobank Data: A gold-standard model for multi-omics integration, the UK Biobank links proteomic, genomic, and clinical data. Researchers use this connection to identify causal drivers of disease, such as identifying specific proteins linked to cardiac conditions.
  1. Histopathology and Image Extraction

Traditionally, analyzing microscope slides from animal or human tissues relied on subjective expert interpretation. AI-driven image recognition techniques now make histopathology slide data extraction more objective by gathering quantitative features from scanned pathology reports and PDF lab results to identify treatment-related effects.

In addition to multimodal data sources, researchers are developing Multimodal Language Models (MLMs) to process text, images, and genetic sequences simultaneously. Integrating omics data with clinical features and imaging allows AI to identify more robust therapeutic targets, improve patient stratification for clinical trials, and predict clinical outcomes by correlating genetic variants with clinical biomarkers [4].

Having Multimodal Data Is Only the Starting Point

AI models depend on diverse, high-resolution datasets. This data diversity increases predictive potential, but also structural complexity. Without disciplined integration and governance, AI may amplify noise rather than deliver impactful insights. The true power of multimodal AI emerges only when data types are used in downstream decision-making in mind.

Prepare Your Data for AI Model Training

Author: Martyna Piotrowska

Technical editing:  Ardigen expert: Ida Kupś

Bibliography

Bibliography:

  1. Mishra P. How machine learning reads chemical structures. Neovarsity Blog. 2025 May 8 [cited 2026 Feb 4]. [Available from:] https://neovarsity.org/blogs/how-machine-learning-reads-chemical-structures
  2. Gangwal A, Ansari A, Ahmad I, Azad AK, Mohd Azizi W, Sulaiman W. Current strategies to address data scarcity in artificial intelligence-based drug discovery: a comprehensive review. Comput Biol Med. 2024 (179); 108734. https://doi.org/10.1016/j.compbiomed.2024.108734
  3. Swanson K, Walther P, Leitz J, et al. ADMET-AI: A machine learning ADMET platform for evaluation of large-scale chemical libraries. Bioinf. 2024;40(7), btae416. https://doi.org/10.1093/bioinformatics/btae416
  4. Zoccoli A, Velez CN, Geukes Foppen RJ, Gioia V. From siloed data to breakthroughs: multimodal AI in drug discovery. Drug Target Review. 2025 Jun 11 [cited 2026 Feb 5]. [Available from:] https://www.drugtargetreview.com/article/160597/from-siloed-data-to-breakthroughs-multimodal-ai-in-drug-discovery/

You might be also interested in:

Abstract network visualization representing AI-driven integration of biological data and knowledge graphs for target identification in drug discovery.
Target Identification: From Poor Data to Quality Predictions
Abstract data streams representing data sourcing in pharmaceutical research and AI drug discovery
What Are Common Data Sourcing Patterns in Pharmaceutical Research (part 3)
Data quality management in AI-powered drug discovery and pharmaceutical research
Why Data Quality Matters in AI-powered Drug Discovery (part 1)
Scientist working with AI-driven drug discovery data in a biopharma laboratory
A practical 2026 roadmap for adopting AI in biopharma R&D

Contact

Ready to transform drug discovery?

Discover how one of the top AI CROs in the world, can be your trusted partner in revolutionizing drug discovery through AI.

Contact us today to learn more about our tailored solutions for empowering your drug development journey.

Send us a message and we will contact you back within 48 hours.

Newsletter

Become an insider

Be the first to know about Ardigen’s latest news and get access to our publications, webinars and more!