Data Journey: Towards Better Target Identification in Drug Discovery
Summary:
- Modern Drug Discovery generates more data than ever, but its utilization is what sets the pace for the industry. Public and proprietary omics datasets often remain siloed, inconsistently annotated, and burdened by batch effects, forcing scientists to spend up to 60% of their time on manual data wrangling rather than on discovery.
- Ardigen’s structured data journey resolves this bottleneck. Automated ingestion pipelines (LLM-augmented ETL and GPU-accelerated workflows), ontology-based standardization, and rigorous FAIRification transform heterogeneous multimodal datasets into AI-ready data products. These foundations enable scalable modeling that integrates single-cell, spatial omics, and chemical structure data.
- When combined with explainable AI frameworks, knowledge graphs, graph neural networks, and MoA-backed validation, this approach moves beyond statistical ranking to biologically grounded target identification.
- In practice, it improves high-quality prediction rates (ROC AUC > 0.8), speeding up target selection (20+ cases so far), optimizing agent design (10x fewer experiments, 4x cheaper) and optimizing target populations.
- The result: a reproducible, explainable, and scalable target identification process.
The current pharmaceutical industry faces a data paradox. Public data repositories, as well as in-house data generation, are expanding at an exponential rate, yet the journey from raw data to actionable therapeutic insight has never been more arduous. The primary bottleneck is not a lack of data but a lack of insights from it. Datasets remain siloed by inconsistent metadata, varying normalization methods, and assay-specific biases.
For most research organizations, this fragmentation is an existential threat to R&D timelines. Scientists currently spend up to 60% of their time on manual data wrangling, e.g., harmonizing annotations and correcting batch effects, rather than on discovery [1]. At Ardigen, we design the transition from these fragmented repositories to high-fidelity target hypotheses.
1. Data Sourcing: Laying Foundations
A robust data journey begins with the curation and scrutiny of high-dimensional biological data. The goal is to move beyond a reductionist view of single genes toward a systems-level understanding of disease mechanisms. This requires integrating diverse omics layers, each providing a unique perspective on the cellular state.
Ardigen primarily works with proprietary customer data, helping leverage public resources to enrich internal research. The client’s own data is subjected to the same rigorous processes as public data, including automated quality control, standardization, and batch correction, enabling its integration with external data.
Sourcing from major public repositories is a key component of our strategy to build a structured data infrastructure that, through automation and rigorous curation, ensures repeatable, reliable scientific discoveries.
We have standardized approaches for sources such as CELLxGENE, GEO, JUMO-CP, RxRx, ChEMBL, and the Human Cell Atlas (HCA), as well as for other databases: 3CA, HTAN, and TISCH, to enrich an organization’s internal data resources.
2. Data Storage, Governance and Compliance: You Can’t Avoid It Anyway
To utilize data effectively and meet rigorous regulatory requirements, pharmaceutical companies need a well-designed storage and management system. Ardigen provides a custom cloud storage solution for large volumes of multimodal data. The system provisions:
- Optimized storage in the cloud or on-premises, depending on customer needs.
- Data anonymization, full logging, and process auditability.
- Secure, scalable architecture that enables seamless transfer and processing of terabytes of information.
For one of our clients, it reduced image storage and processing costs by 50%.
3. Data Ingestion and Processing: Solving the Manual Wrangling Crisis
The ingestion stage is the definitive make-or-break factor for scaling drug discovery. Traditional manual curation is a labor-intensive liability that often requires days of effort to process a single dataset. By deploying LLM-augmented ETL (Extract, Transform, Load) tools paired with GPU-accelerated workflows – e.g., leveraging NVIDIA RAPIDS for single-cell data – this bottleneck is eliminated [1].
By using Ardigen phenAID – a platform designed for multimodal, high-throughput data – researchers can achieve analysis speeds up to 100x faster than standard benchmarks through GPU and custom optimizations [2].
Automating metadata parsing and harmonization at ingestion reduces the processing window from days to under an hour while maintaining 90% accuracy [1,3]. This shift allows R&D teams to refocus their resources on downstream modeling rather than upstream cleanup.
For one of our clients, the refined data ingestion and processing process saved 80% of former costs [1,3].
4. Data Curation and FAIRification: Why AI-Ready Atlases Are the New R&D Currency
Building a repository is insufficient for modern discovery. The objective must be to create AI-ready data products. This requires a rigorous FAIRification process and the use of strict ontologies, such as Oncotree, to ensure cross-project consistency.
A senior strategist understands that standardization is the prerequisite for meaningful cross-cohort meta-analysis. By sourcing and harmonizing data from diverse public repositories, including CELLxGENE, the Human Cell Atlas (HCA), GEO, 3CA, HTAN, TISCH, and HUSCH, researchers can build massive, high-fidelity single-cell atlases [1,3].
A prime example of this scale is our GI tract atlas, which integrates data from over 518,000 cells across 94 patients to compare healthy tissue with Crohn’s disease and colon cancer [1]. These atlases allow for reliable cell typing and spatial proximity analysis, providing a standardized foundation for projects in Oncology, Immunology, and beyond.
5. Data Accessibility and Exploration: Putting Power Back in the Hands of Biologists
The requirement for sophisticated coding skills to simply browse a library is a significant barrier to discovery. To democratize research, it is essential to build code-free exploration tools.
Our Dedicated Explorer applications provide a code-free interface for interactive expression browsing, gene scoring, multi-gene correlation, and visualization of cell/tissue composition via UMAPs [1]. This allows biologists to verify a dataset’s modeling potential before committing significant computational resources, effectively putting the power of exploratory data analysis (EDA) back in the hands of subject matter experts.
A modern extension of this paradigm is the integration of agentic, LLM-based tools that provide conversational access to complex datasets and lightweight analytical workflows. Scientists can interact with data through natural-language queries, iteratively refine hypotheses, request cohort stratifications, or trigger downstream computations without writing a line of code or navigating dashboards.
These AI agents act as orchestration layers, translating biological questions into executable pipelines, retrieving relevant subsets, performing statistical summaries, and contextualizing results against prior knowledge.
Of course, this approach does not replace bioinformatics workflows. But it meaningfully accelerates early-stage sensemaking, lowers the cognitive barrier to data interrogation, and shortens the feedback loop between hypothesis generation and quantitative validation.
6. Exploratory Analysis: Better Safe Than Sorry
This stage acts as a bridge between data preparation (Data Readiness) and advanced AI modeling. It allows researchers to test hypotheses and detect anomalies in the data before moving on to the costly algorithm training phase. Biologists can also make a preliminary assessment of the return on investment of further computational analyses at this point.
Example methods of exploratory analysis are:
- Differential expression, being the most basic yet still powerful, identifies genes with significantly different activity levels between diseased and healthy states. For example, in colorectal cancer, this analysis helps identify genes upregulated in specific tissue compartments, such as the epithelium or immune cells.
- Gene Set Enrichment (GSE) enables us to move from analyzing individual genes to understanding entire biological pathways. Thanks to GSE, researchers can identify which molecular processes (e.g., angiogenesis, cell migration, or response to stimuli) are key to a specific disease.
- RNA velocity analysis is an advanced technique used in single-cell data to predict cells’ future states based on the ratio of unspliced to spliced mRNA. This enables the analysis of cell development trajectories and tissue change dynamics, which is essential for understanding disease progression.
- Unsupervised clustering simultaneously detects unexpected quality issues across multiple datasets and identifies artifacts specific to each dataset through anomaly detection algorithms that often accompany the clustering process.
7. AI Training and Modeling: The Multimodal Ranking Advantage
Precision target identification depends on aggregating signals across multiple modalities. Robust, biologically valid candidates are identified only when data from single-cell and spatial omics are integrated with other molecular datasets.
Ardigen Target ID algorithms rank candidates based on expression potency, specificity, and prevalence, effectively filtering noise to surface targets that remain consistent across heterogeneous biological contexts. This multimodal approach minimizes the ‘false start’ problem in wet-lab validation, where millions of dollars are often lost on candidates that lack a robust data foundation.
In drug discovery, a statistical rank is meaningless without a biological rationale. To close the trust gap, AI model predictions must be explainable and backed by Mechanism of Action (MoA). We bridge this gap by integrating Biomedical Knowledge Graphs and leveraging foundations such as PrimeKG, using Graph Neural Networks (GNNs) for link prediction [1].
To ensure reliability, we employ a GraphRAG (Retrieval-Augmented Generation) system to synthesize evidence-based MoA reports [1]. Tools such as SHAP, LIME, and Attention mechanisms built into GNNs allow us to identify which molecular interactions or signaling pathways contributed to the target’s high ranking. This provides the transparent reasoning and referencing necessary to move from a numerical prioritization score to a confident scientific hypothesis.
By using algorithms such as Personalized PageRank for novelty scoring, our framework identifies high-impact targets that traditional methods might miss (targets that have never been widely described in the scientific literature).
8. Automated Insights and Discovery: Success (Nearly) on Autopilot
The final analytical stage, where prepared, high-fidelity data products are converted into actionable therapeutic hypotheses and validated targets. This stage moves beyond simple data description to provide predictive power and biological rationale that support confident decisions.
The discovery process is designed as an automated Lab-in-the-loop. As new data is ingested into the system (e.g., from public repositories or proprietary client data), the knowledge graph and predictive models are continuously refined, automatically updating findings and prioritizing the target hypothesis list.
Fig. 1. Automated Lab-in-the Loop System
An important issue is versioning, which allows you to track the model’s evolution and return to earlier iterations, an essential feature in rigorous research and development (R&D) processes. It is part of a secure, compliant, and scalable architecture that enables the processing of terabytes of data and the management of multiple models simultaneously.
The practical application of our solutions has demonstrated significant success in pharmaceutical R&D clients. Custom AI models for modality representation and prediction brought 30% of improvement in the number of high-quality predictions (ROC AUC >0.8). In one case, integrating an AI model into the daily workflow of over 100 scientists at our client’s organization yielded 20 novel target nominations [2,3].
Metric | Impact |
|---|---|
ROC AUC | >0.8 |
Target nominations | 20+ |
Experiments required | 10x fewer |
Cost reduction | 4x lower |
Ingestion speed | 100x faster |
Table 1. Key performance indicators for AI-driven target identification performance and efficiency.
The Future of the Data Journey
The transition from fragmented data to validated therapeutic candidates is a proven reality. This structured data journey has already supported the nomination of over 20 novel targets and driven a 30% improvement in the volume of high-quality predictions, achieving a ROC AUC > 0.8 [2,3]. Thanks to skilled experts in biology and engineering, we enable the R&D ecosystem to move with unprecedented speed and confidence.
If your research team could reclaim the 60% of time currently lost to data wrangling, what breakthrough would you reach by next year? Give your organization a chance to see.
Boost Your Target Identification
Author: Martyna Piotrowska
Technical editing: Ardigen expert: Jan Majta, PhD
Bibliography
- Kupś I, Wesołowski S, Widawski J, et al. An End-to-End Data-to-Insights Journey for scRNA and Spatial Omics with Knowledge Graphs [Poster]. Festival of Genomics and Biodata, London. 2026 [cited 2026 Feb 20]. [Available from:] https://ardigen.com/poster-from-public-repositories-to-target-hypotheses/
- Ardigen. Scalable prediction pipelines for AI-driven morphological profiling [Case Study]. 2024.
- Ardigen. From complex data to AI-driven drug discovery insights [Brochure]. 2026.
- Ardigen. Using single cell and spatial data for knowledge-graph augmented target mining system [Case Study]. 2024.