Nowadays, publicly available datasets democratize access to the raw material for discovery and enable a scale of analysis previously unattainable for individual research groups. Data scientists, bioinformaticians, and R&D managers can turn this chaos of raw data into a reliable, AI-ready resource that can drive real-world insights. Can you take most of its potential too?
The Strategic Value of Public Data
Public repositories are a strategic asset that provides three critical advantages for any organization engaged in biomedical R&D:
- Enriching Internal Data Products. Internal, proprietary datasets are often limited in scope or size. Public datasets can be used to augment and enrich these collections, providing a broader biological context. For example, a company with a small-scale genomic study can integrate it with a large-scale public repository to increase statistical power and discover new insights.
- External Validation and Benchmarking. AI models trained on private data can be validated against public datasets to ensure their performance is generalizable and robust. Platforms like the Therapeutics Data Commons (TDC) provide specific dataset splits (e.g., temporal or scaffold) that are designed to test a model’s ability to generalize to new, out-of-distribution data – a crucial step before clinical application.
- Providing Additional Modalities. A research project focused on a single data type, such as genomics, can be enhanced by incorporating data from other public repositories. Integrating medical imaging from The Cancer Imaging Archive (TCIA) or clinical data from the Medical Information Mart for Intensive Care (MIMIC) provides a more holistic, multi-modal view of a disease. It helps compute more accurate predictions and, in turn, delivers a deeper understanding of underlying biology.
A Curated Catalog: Foundational Datasets for AI
The public data landscape is vast, but not all resources are created equal. Knowing where to look for high-quality, relevant data is the first step toward building an effective AI pipeline. Let’s dive into our subjective selection of public databases most useful in clinical and biotechnological research.
Most well-recognized repositories in genomics, proteomics, and chemistry are:
- The Cancer Genome Atlas (TCGA): A landmark program that has molecularly characterized over 20,000 primary cancer samples across 33 cancer types, generating petabytes of genomic, epigenomic, and proteomic data.
- Cancer Cell Line Encyclopedia (CCLE): A powerful resource providing genomic and drug sensitivity data for a large number of human cancer cell lines, a critical resource for preclinical research.
- BioStudies: A central repository from the European Bioinformatics Institute (EMBL-EBI) that provides a home for a wide array of datasets, including those that are not easily categorized into other specialized archives.
- Library of Integrated Network-Based Cellular Signatures (LINCS): This project provides a collection of gene expression and other cellular signature data following small molecule or genetic perturbations, a cornerstone for understanding drug mechanisms of action.
- ProteomeXchange: The global hub for proteomics data, with over 21,000 datasets, providing a molecular view of biological systems.
When you operate on the level of the cell, the best public databases for single-cell omics are:
- CellXGene: A powerful platform for exploring single-cell transcriptomic datasets, allowing researchers to visualize and analyze data from millions of cells to understand tissue function at a granular level.
- Human Tumor Atlas Network (HTAN): A collaborative effort to generate comprehensive atlases of human tumors at single-cell resolution, providing high-dimensional spatial and molecular data.
- OpenCell: A public resource that provides a comprehensive proteomic map of human cells, enabling a deeper understanding of protein localization and function.
As you see, there is a lot to choose from. The trick is to find a relevant one that can provide data that really fits your project and goals.
Dataset Name | Data Type | Key Focus |
|---|---|---|
TCGA | Omics (Genomic, Proteomic, etc.) | Comprehensive characterization of over 33 cancer types. |
CCLE | Omics & Phenotypic | Genomic and drug sensitivity data for cancer cell lines. |
BioStudies | Varies | A general repository for diverse biological data. |
LINCS | Omics & Phenotypic | Cellular signatures in response to perturbations. |
ProteomeXchange | Omics (Proteomics) | Global repository for mass spectrometry-based proteomics data. |
CellXGene | Omics (Single-Cell) | Single-cell transcriptomic data for cell and tissue atlases. |
HTAN | Omics & Spatial | High-dimensional atlases of human tumors. |
OpenCell | Omics (Proteomics) | Subcellular protein localization data in human cells. |
Table 1. Curated public biological datasets supporting AI-driven research and drug discovery.
From Raw Data to AI-Ready Input
A common mistake is assuming that simply downloading raw .fastq or .h5ad files is sufficient. The most significant challenge in utilizing public data is the “data mess”. The experts estimate that 97% of biological and health data is fragmented and inaccessible. The fragmentation, varying formats, and noisy labels prevent data from being useful. This is where harmonization and curation become a must.
The process of building a data input ready for training artificial intelligence models involves a few key steps:
- Standardization
Raw data from different sources must be converted to a common, standardized format to enable large-scale, integrated analysis.
- Harmonization
Data from different platforms or studies must be normalized to remove batch effects and technical noise, ensuring that biological signals, not technical artifacts, are what a model learns from.
- Curation
Experts must clean, annotate, and label the data to make it usable for supervised learning tasks, a process that is often time-consuming but invaluable.
Some platforms, such as the Therapeutics Data Commons (TDC), have already done this work, providing 66 expertly curated datasets across 22 tasks. By using such platforms, organizations can bypass the significant barrier of data preprocessing and focus on building and evaluating their models.
Otherwise, a helpful hand of an experienced data scientist who knows how to prepare data will be indispensable.
AI/ML Use Cases: Pretraining, Augmentation, and Benchmarking
Once you have high-quality, AI-ready data, you can apply it to a range of powerful AI/ML use cases that go beyond simple analysis:
- Pretraining: foundational models can be pretrained on massive public datasets to learn general biological patterns before being fine-tuned for a specific task using a smaller, proprietary dataset. This approach improves model performance and reduces the need for large-scale private data collection.
- Augmentation: public data can be used to augment private datasets, increasing the diversity and size of the training set. This is particularly valuable for rare diseases or underrepresented patient populations.
- Benchmarking: curated datasets with pre-defined validation splits are essential for rigorously testing a model’s performance. They can reveal if a model that performs well in a controlled setting can actually generalize to new, unseen data, which is a critical requirement for real-world application.
The Ethical Imperative: Trust, Transparency, and Compliance
Using public health data comes with significant ethical and regulatory responsibilities. Organizations must proactively address problems like algorithmic bias, privacy, confidentiality, licensing, and compliance.
Many public datasets are not demographically diverse. Models trained on such data can perpetuate historical inequities, leading to unequal treatment and inaccurate predictions for minority populations. Inclusive data collection and continuous model auditing are essential to mitigate this risk.
While many public datasets are de-identified, the risk of patient re-identification increases when multiple data sources are linked. Robust cybersecurity, strong anonymization protocols, and adherence to regulations like HIPAA and GDPR are non-negotiable.
Also, public datasets often have specific licensing requirements for their use. It is critical to understand and comply with these licenses to ensure the ethical and legal use of the data for commercial and research purposes.
Real-World Impact: How Ardigen Taps into Public Data
Companies like ours demonstrate how real-world public data can be used to fuel drug discovery. At Ardigen, we strategically combine public datasets and proprietary AI models to transform complex data into actionable insights for our partners.
Our Data Universe offering gives you direct access to a curated and continuously expanded repository of life-science datasets, including:
- more than 100,000 biobank subjects with clinical records and multi-omic profiles,
- over 200,000 publicly sourced research projects with rich metadata,
- Ardigen’s in-house cohorts (for example, a transcriptome CRC cohort of 86 subjects, single-cell atlases covering 40+ cell types, microbiome data from 600+ subjects).
In partnership with you, we bridge your proprietary data and our public/lit-derived/in-house asset. Behind the scenes, we clean, harmonize and structure raw data into common ontologies and formats.
Then, for these hybrid datasets, we apply advanced multimodal integration (omics, clinical, imaging, and morphological). The outcome? Library of tailored data products providing a robust foundation for AI/ML solutions: you get accelerated hypothesis testing, higher confidence in target or biomarker discovery, and better alignment between preclinical models and clinical reality.
In short, we provide the data resources + infrastructure + integration workflow, and you bring the domain expertise and project objectives.
Ardigen phenAID also supports the JUMP-CP Consortium. The consortium aims at validating and scaling up image-based drug discovery strategies by creating the world’s largest public cell imaging dataset, corresponding to both genetic and chemical perturbations.
Its members include 10 leading pharmaceutical companies (Amgen, AstraZeneca, Bayer, Biogen, Eisai, Janssen Pharmaceutica NV, Merck KGaA, Pfizer, Servier, and Takeda) and two non-profit research organizations: the Broad Institute of MIT and Harvard and Ksilink. We provide deep learning expertise and facilitate exploration of the JUMP-CP Cell Painting dataset via a dedicated web application.
Spend less time struggling with inconsistent data harmonization and free up your time to make discoveries. With our preprepared data and user-friendly access to integrated life science datasets, you can get insights faster than ever before.
To learn more about Ardigen’s approach, visit Phenotypic Profiling.
Author: Martyna Piotrowska
Technical editing: Ardigen expert: Jan Majta, PhD
Bibliography:
- Weinstein JN, Collisson EA, Mills GB, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113-1120. https://www.nature.com/articles/ng.2764
- Barretina J, Caponigro G, Stransky N, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483(7388):603-607. https://doi.org/10.1038/nature11003
- Sarkans U, et al. The BioStudies database – one stop shop for all data supporting a life sciences study. Nucleic Acids Res. 2018;46(D1):D1266–70. https://doi.org/10.1093/nar/gkx965
- Subramanian A, Narayan R, Corsello SM, et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell. 2017;171(6):1437-1452.e17. https://doi.org/10.1016/j.cell.2017.10.049
- Vizcaíno JA, Deutsch EW, Wang R, et al. ProteomeXchange: globally coordinating proteomics data submission and dissemination. Nat Biotechnol. 2014;32(3):223-226. https://doi.org/10.1038/nbt.2839
- The Human Tumor Atlas Network (HTAN) Consortium. The Human Tumor Atlas Network (HTAN): A multi-center collaboration to build a comprehensive atlas of human cancers. Cell. 2020;182(2):290-297. https://doi.org/10.1016/j.cell.2020.03.053
- Ardigen. Driving better drug discovery: smarter data, smarter decisions. [Internet]. 2025 [cited 2025 Oct 16]. [Available from:] https://ardigen.com/driving-better-drug-discovery-smarter-data-smarter-decisions/
- Ardigen. From data chaos to drug discovery [Internet]. 2025 [cited 2025 Oct 27]. [Available from:] https://ardigen.com/from-data-chaos-to-drug-discovery/
- Ardigen. Ardigen phenAID’s multimodal approach improves MoA and bioactivity prediction when applied to a HCS dataset from a Big Pharma company [Internet]. 2025 [cited 2025 Oct 27]. [Available from:] https://ardigen.com/ardigen-phenaids-multimodal-approach-improves-moa-and-bioactivity-prediction-when-applied-to-a-hcs-dataset-from-a-big-pharma-company/