Leveraging Public Datasets for AI Discovery

Leveraging Public Datasets for AI-Driven Discovery

Topic:

Nowadays, publicly available datasets democratize access to the raw material for discovery and enable a scale of analysis previously unattainable for individual research groups. Data scientists, bioinformaticians, and R&D managers can turn this chaos of raw data into a reliable, AI-ready resource that can drive real-world insights. Can you take most of its potential too?

The Strategic Value of Public Data

Public repositories are a strategic asset that provides three critical advantages for any organization engaged in biomedical R&D:

Enriching Internal Data Products. Internal, proprietary datasets are often limited in scope or size. Public datasets can be used to augment and enrich these collections, providing a broader biological context. For example, a company with a small-scale genomic study can integrate it with a large-scale public repository to increase statistical power and discover new insights.

External Validation and Benchmarking. AI models trained on private data can be validated against public datasets to ensure their performance is generalizable and robust. Platforms like the Therapeutics Data Commons (TDC) provide specific dataset splits (e.g., temporal or scaffold) that are designed to test a model’s ability to generalize to new, out-of-distribution data – a crucial step before clinical application.
Providing Additional Modalities. A research project focused on a single data type, such as genomics, can be enhanced by incorporating data from other public repositories. Integrating medical imaging from The Cancer Imaging Archive (TCIA) or clinical data from the Medical Information Mart for Intensive Care (MIMIC) provides a more holistic, multi-modal view of a disease. It helps compute more accurate predictions and, in turn, delivers a deeper understanding of underlying biology.

A Curated Catalog: Foundational Datasets for AI

The public data landscape is vast, but not all resources are created equal. Knowing where to look for high-quality, relevant data is the first step toward building an effective AI pipeline. Let’s dive into our subjective selection of public databases most useful in clinical and biotechnological research.

Most well-recognized repositories in genomics, proteomics, and chemistry are:

The Cancer Genome Atlas (TCGA): A landmark program that has molecularly characterized over 20,000 primary cancer samples across 33 cancer types, generating petabytes of genomic, epigenomic, and proteomic data.
Cancer Cell Line Encyclopedia (CCLE): A powerful resource providing genomic and drug sensitivity data for a large number of human cancer cell lines, a critical resource for preclinical research.
BioStudies: A central repository from the European Bioinformatics Institute (EMBL-EBI) that provides a home for a wide array of datasets, including those that are not easily categorized into other specialized archives.
Library of Integrated Network-Based Cellular Signatures (LINCS): This project provides a collection of gene expression and other cellular signature data following small molecule or genetic perturbations, a cornerstone for understanding drug mechanisms of action.
ProteomeXchange: The global hub for proteomics data, with over 21,000 datasets, providing a molecular view of biological systems.

When you operate on the level of the cell, the best public databases for single-cell omics are:

CellXGene: A powerful platform for exploring single-cell transcriptomic datasets, allowing researchers to visualize and analyze data from millions of cells to understand tissue function at a granular level.
Human Tumor Atlas Network (HTAN): A collaborative effort to generate comprehensive atlases of human tumors at single-cell resolution, providing high-dimensional spatial and molecular data.
OpenCell: A public resource that provides a comprehensive proteomic map of human cells, enabling a deeper understanding of protein localization and function.

As you see, there is a lot to choose from. The trick is to find a relevant one that can provide data that really fits your project and goals.

Dataset Name	Data Type	Key Focus
TCGA	Omics (Genomic, Proteomic, etc.)	Comprehensive characterization of over 33 cancer types.
CCLE	Omics & Phenotypic	Genomic and drug sensitivity data for cancer cell lines.
BioStudies	Varies	A general repository for diverse biological data.
LINCS	Omics & Phenotypic	Cellular signatures in response to perturbations.
ProteomeXchange	Omics (Proteomics)	Global repository for mass spectrometry-based proteomics data.
CellXGene	Omics (Single-Cell)	Single-cell transcriptomic data for cell and tissue atlases.
HTAN	Omics & Spatial	High-dimensional atlases of human tumors.
OpenCell	Omics (Proteomics)	Subcellular protein localization data in human cells.

Table 1. Curated public biological datasets supporting AI-driven research and drug discovery.

From Raw Data to AI-Ready Input

A common mistake is assuming that simply downloading raw .fastq or .h5ad files is sufficient. The most significant challenge in utilizing public data is the “data mess”. The experts estimate that 97% of biological and health data is fragmented and inaccessible. The fragmentation, varying formats, and noisy labels prevent data from being useful. This is where harmonization and curation become a must.

The process of building a data input ready for training artificial intelligence models involves a few key steps:

Standardization

Raw data from different sources must be converted to a common, standardized format to enable large-scale, integrated analysis.

Harmonization

Data from different platforms or studies must be normalized to remove batch effects and technical noise, ensuring that biological signals, not technical artifacts, are what a model learns from.

Curation

Experts must clean, annotate, and label the data to make it usable for supervised learning tasks, a process that is often time-consuming but invaluable.

Some platforms, such as the Therapeutics Data Commons (TDC), have already done this work, providing 66 expertly curated datasets across 22 tasks. By using such platforms, organizations can bypass the significant barrier of data preprocessing and focus on building and evaluating their models.

Otherwise, a helpful hand of an experienced data scientist who knows how to prepare data will be indispensable.

AI/ML Use Cases: Pretraining, Augmentation, and Benchmarking

Once you have high-quality, AI-ready data, you can apply it to a range of powerful AI/ML use cases that go beyond simple analysis:

Pretraining: foundational models can be pretrained on massive public datasets to learn general biological patterns before being fine-tuned for a specific task using a smaller, proprietary dataset. This approach improves model performance and reduces the need for large-scale private data collection.
Augmentation: public data can be used to augment private datasets, increasing the diversity and size of the training set. This is particularly valuable for rare diseases or underrepresented patient populations.
Benchmarking: curated datasets with pre-defined validation splits are essential for rigorously testing a model’s performance. They can reveal if a model that performs well in a controlled setting can actually generalize to new, unseen data, which is a critical requirement for real-world application.

The Ethical Imperative: Trust, Transparency, and Compliance

Using public health data comes with significant ethical and regulatory responsibilities. Organizations must proactively address problems like algorithmic bias, privacy, confidentiality, licensing, and compliance.

Many public datasets are not demographically diverse. Models trained on such data can perpetuate historical inequities, leading to unequal treatment and inaccurate predictions for minority populations. Inclusive data collection and continuous model auditing are essential to mitigate this risk.

While many public datasets are de-identified, the risk of patient re-identification increases when multiple data sources are linked. Robust cybersecurity, strong anonymization protocols, and adherence to regulations like HIPAA and GDPR are non-negotiable.

Also, public datasets often have specific licensing requirements for their use. It is critical to understand and comply with these licenses to ensure the ethical and legal use of the data for commercial and research purposes.

Real-World Impact: How Ardigen Taps into Public Data

Companies like ours demonstrate how real-world public data can be used to fuel drug discovery. At Ardigen, we strategically combine public datasets and proprietary AI models to transform complex data into actionable insights for our partners.

Our Data Universe offering gives you direct access to a curated and continuously expanded repository of life-science datasets, including:

more than 100,000 biobank subjects with clinical records and multi-omic profiles,
over 200,000 publicly sourced research projects with rich metadata,
Ardigen’s in-house cohorts (for example, a transcriptome CRC cohort of 86 subjects, single-cell atlases covering 40+ cell types, microbiome data from 600+ subjects).

In partnership with you, we bridge your proprietary data and our public/lit-derived/in-house asset. Behind the scenes, we clean, harmonize and structure raw data into common ontologies and formats.

Then, for these hybrid datasets, we apply advanced multimodal integration (omics, clinical, imaging, and morphological). The outcome? Library of tailored data products providing a robust foundation for AI/ML solutions: you get accelerated hypothesis testing, higher confidence in target or biomarker discovery, and better alignment between preclinical models and clinical reality.

In short, we provide the data resources + infrastructure + integration workflow, and you bring the domain expertise and project objectives.

Ardigen phenAID also supports the JUMP-CP Consortium. The consortium aims at validating and scaling up image-based drug discovery strategies by creating the world’s largest public cell imaging dataset, corresponding to both genetic and chemical perturbations.

Its members include 10 leading pharmaceutical companies (Amgen, AstraZeneca, Bayer, Biogen, Eisai, Janssen Pharmaceutica NV, Merck KGaA, Pfizer, Servier, and Takeda) and two non-profit research organizations: the Broad Institute of MIT and Harvard and Ksilink. We provide deep learning expertise and facilitate exploration of the JUMP-CP Cell Painting dataset via a dedicated web application.

Spend less time struggling with inconsistent data harmonization and free up your time to make discoveries. With our preprepared data and user-friendly access to integrated life science datasets, you can get insights faster than ever before.

To learn more about Ardigen’s approach, visit Phenotypic Profiling.

Author: Martyna Piotrowska

Technical editing: Ardigen expert: Jan Majta, PhD

Bibliography:

Weinstein JN, Collisson EA, Mills GB, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113-1120. https://www.nature.com/articles/ng.2764
Barretina J, Caponigro G, Stransky N, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483(7388):603-607. https://doi.org/10.1038/nature11003
Sarkans U, et al. The BioStudies database – one stop shop for all data supporting a life sciences study. Nucleic Acids Res. 2018;46(D1):D1266–70. https://doi.org/10.1093/nar/gkx965
Subramanian A, Narayan R, Corsello SM, et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell. 2017;171(6):1437-1452.e17. https://doi.org/10.1016/j.cell.2017.10.049
Vizcaíno JA, Deutsch EW, Wang R, et al. ProteomeXchange: globally coordinating proteomics data submission and dissemination. Nat Biotechnol. 2014;32(3):223-226. https://doi.org/10.1038/nbt.2839
The Human Tumor Atlas Network (HTAN) Consortium. The Human Tumor Atlas Network (HTAN): A multi-center collaboration to build a comprehensive atlas of human cancers. Cell. 2020;182(2):290-297. https://doi.org/10.1016/j.cell.2020.03.053
Ardigen. Driving better drug discovery: smarter data, smarter decisions. [Internet]. 2025 [cited 2025 Oct 16]. [Available from:] https://ardigen.com/driving-better-drug-discovery-smarter-data-smarter-decisions/
Ardigen. From data chaos to drug discovery [Internet]. 2025 [cited 2025 Oct 27]. [Available from:] https://ardigen.com/from-data-chaos-to-drug-discovery/
Ardigen. Ardigen phenAID’s multimodal approach improves MoA and bioactivity prediction when applied to a HCS dataset from a Big Pharma company [Internet]. 2025 [cited 2025 Oct 27]. [Available from:] https://ardigen.com/ardigen-phenaids-multimodal-approach-improves-moa-and-bioactivity-prediction-when-applied-to-a-hcs-dataset-from-a-big-pharma-company/

You might be also interested in:

Poster

19 May 2026

Large Language Model platform for patient-friendly content

Blog

15 May 2026

The 49% Problem: Why Closing the Lab-AI Loop Starts Beneath the Iceberg

Webinars

4 May 2026

Lab-in-the-Loop: reclaiming the 50% of scientific time lost to data

Poster

29 April 2026

AI-Driven De Novo Generation of Protein Binders

Contact

Ready to transform drug discovery?

Discover how one of the top AI CROs in the world, can be your trusted partner in revolutionizing drug discovery through AI.

Send us a message and we will contact you back within 48 hours.

Become an insider

Be the first to know about Ardigen’s latest news and get access to our publications, webinars and more!