The 49% Problem: Why Closing the Lab-AI Loop Starts Beneath the Iceberg

Topic:

lab-in-the-loop drug discovery

The 49% Problem: Why Closing the Lab-AI Loop Starts Beneath the Iceberg

Key Takeaways

  • Since 2022, biopharma scientists were still losing ~49% of their time to manual data retrieval and transformation; lab automation has often deferred — not removed — that tax. [2022 industry survey
  • The bottleneck blocking production lab-AI is no longer model quality; it is the data, QC, and provenance layer beneath the agent.
  • A working closed-loop lab needs three engineered layers in order: data infrastructure, knowledge, and agentic orchestration – each is necessary, none is sufficient alone.
  • Human-in-the-loop is not a fallback; the handoff point must be designed into the architecture from the start
  • When the loop closes, scientific time shifts from preparation to interpretation – and previously discarded negative results become a permissioned data asset.

In 2022, an industry survey found that biopharma scientists still lose 49% of their working time to manual data retrieval and transformation. [2022 industry survey] McKinsey later reported that only 3 of 50 global life-science leaders believed they had solved this manual workload bottleneck. [McKinsey 2025]

That number has not disappeared with lab automation, or with the first wave of AI assistants. In many programs it has simply moved, from the scientist moving files to an agent hallucinating because the data layer beneath it cannot give it reliable inputs.

This is the 49% data tax. And it is the reason most lab-in-the-loop pilots stall short of production.

Where the 49% actually hides

When automated labs scale without an underlying data orchestration layer, organizations don’t scale discovery,  they scale inefficiency. The manual work that consumes scientific time, and paralyses AI agents, usually falls into three categories:

  • Data retrieval friction. Files sit across instrument exports, ELN attachments, assay databases, and local spreadsheets. Annoying for a scientist; disabling for an orchestrating agent.
  • Format normalization and biology-aware QC. Instruments emit files, not AI-ready data products. Without biology-aware QC, for example, control-well anomaly detection in HCS imaging or drift detection in mass-spec assays, a technically correct image of an anomalous DMSO control still passes, and contaminates every downstream model.
  • Context capture. Provenance, protocol variations, batch effects. A scientist can mentally compensate for missing metadata. Without grounded retrieval and provenance, LLMs tend to confabulate rather than abstain.

The first two are infrastructure problems. The third is a design problem: scientific meaning must travel with the data, not in the scientist’s head.

Why the AI pilot does not become a production system

When pilots break in production, the default explanation is model quality. In the programs we work in, that diagnosis is usually wrong. The pilot worked because a data scientist quietly paid the 49% tax, curating datasets, chasing metadata, and hand-verifying QC. The model wasn’t carrying the project; the curation was. Once curation stops, the system breaks.

The answer is not more human curators. It is an engineered data layer that handles ingestion, normalization, QC, and provenance for standard, repeatable tasks, without human intervention.

What a closed-loop lab actually requires

A closed-loop lab is an engineered feedback system with three layers, in order. The framework spans the data modalities most discovery teams actually run on – imaging (HCS), omics (NGS, single-cell), and HTS readouts:

  • Data infrastructure layer. Standardized schemas, real-time ingestion from instruments, and automated, biology-aware QC at landing , not just file-format checks.
  • Knowledge layer. Knowledge graphs that preserve biological context and experimental provenance, and ground LLMs in factual biology. AstraZeneca, for example, has publicly described integrating tens of millions of biomedical relationships from dozens of data sources to identify novel drivers of disease resistance.
  • Agentic layer. Orchestrators that monitor data, trigger workflows, retrieve knowledge, and call predictive models, sitting on top of the structured layers below, not compensating for their absence. Validation and reproducibility checks live here too, so model-driven prioritization is verified before resources are committed.

Each layer is necessary; none is sufficient alone. Data infrastructure without knowledge produces clean but context-poor data. A knowledge graph without automated ingestion goes stale. Agentic orchestration without reliable data is a fluent interface masking uncertainty.

Three-layer architecture diagram for closing the lab-to-model loop, showing Data infrastructure, Knowledge layer, and Agentic layer with arrows linking layers to experiments and data flow.

Fig. 1. Three-layer architecture

What changes when the loop closes

When the architecture is right, experimental output arrives structured. QC flags are caught at landing, not weeks downstream. Models know the provenance of every input. Scientific time shifts from preparation to interpretation.

In one Ardigen small-molecule lead-optimization program, this closed-loop architecture compressed compound design cycles from months to days, driven by active learning across multiple assay cycles. In addition as a side gain, the same architecture also converts “failed” experiments into a permissioned, monetizable internal asset, directly, or, with engineered permissioning and IP controls, through federated learning.

Ardigen has been engineering the data, knowledge, and agentic layers underneath drug discovery for years. The webinar walks through the current state of that work.

Join us live — May 18, 2026

Lab-in-the-Loop: reclaiming the 50% of scientific time lost to data.

Jan Majta, PhD and Sergiusz Wesołowski, PhD will walk through what separates working closed-loop systems from stalled pilots. In 45 minutes, you will see:

  • The three-layer reference architecture, with the integration points to existing LIMS, ELN, and instrument stacks.
  • One production case-study walk-through of an active-learning compound-design program.
  • The human-in-the-loop handoff design pattern that keeps scientific judgment at the high-stakes points.

Save your seat →

Can’t make it live? Register anyway — the same link will host the recording the following week.

Frequently Asked Questions

The “data tax” is the share of biopharma scientists’ time spent on manual data retrieval and transformation instead of scientific work. According to industry survey it is even 49% of scientist time. Key sources of this tax include: instrument file fragmentation, format normalization, missing experimental metadata. 

AI pilots in drug discovery usually stall in production because the pilot was carried by a data scientist manually curating inputs, chasing metadata, hand-verifying QC, and reconciling instrument exports. Once that manual curation stops, the underlying data infrastructure cannot keep the model fed with reliable inputs. The bottleneck is rarely model quality; it is the absence of an engineered data, QC, and provenance layer beneath the model.

A lab-in-the-loop system is a closed-loop architecture in which instrument output, automated QC, knowledge context, and agentic orchestration feed predictive models that, in turn, propose the next experiments. It is built on three engineered layers, data infrastructure, knowledge (typically a knowledge graph), and an agentic orchestration layer, with structured handoffs to scientists at the high-stakes interpretation points.

Knowledge graphs are important for AI in drug discovery because they preserve biological relationships and experimental provenance, and act as a second line of evidence that grounds large language models in factual biology. By integrating millions of biomedical relationships across internal and public sources, a knowledge graph reduces the chance that an LLM will hallucinate when answering target–disease or mechanism-of-action questions.

Moving an AI pilot from POC to production in drug discovery requires engineering the data layer beneath the model, automated ingestion, biology-aware QC, schema normalization, and provenance capture, so the model receives reliable inputs without manual curation. It also requires a knowledge layer for biological context, and an agentic layer with validation and reproducibility checks before resources are committed.

Technical editing:  Ardigen expert: Jan Majta, PhD

You might be also interested in:

Large Language Model platform for patient-friendly content
Webinar: Lab-in-the-Loop: reclaiming the 50% of scientific time lost to data
Lab-in-the-Loop: reclaiming the 50% of scientific time lost to data
AI-Driven De Novo Generation of Protein Binders
Blog cover for Ardigen publication on ARDisplay-I and MHC ligand identification in Molecular & Cellular Proteomics
New publication in MCP: Improving MHC ligand identification with machine learning and optimized isolation

Contact

Ready to transform drug discovery?

Discover how one of the top AI CROs in the world, can be your trusted partner in revolutionizing drug discovery through AI.

Contact us today to learn more about our tailored solutions for empowering your drug development journey.

Send us a message and we will contact you back within 48 hours.

Newsletter

Become an insider

Be the first to know about Ardigen’s latest news and get access to our publications, webinars and more!