Automate and scale data annotation pipeline

Topic:

About the Case Study

In the rapidly evolving field of bioinformatics, managing and integrating vast amounts of public omics data is a significant challenge. To address this, a techbio company asked Ardigen to developed a tailored AI-powered metadata annotation pipeline designed to automate the extraction of structured insights from unstructured datasets. The solution pulls data from sources like NCBI GEO and PubMed, identifying and organizing key metadata fields essential for downstream analysis, machine learning applications, and comprehensive insight generation. By leveraging Large Language Models (LLMs), Retrieval Augmented Generation (RAG), and advanced AI techniques, this approach significantly reduces manual effort, enhances accuracy, and enables scalable data processing.

Goal

The primary objective was to build and integrate an optimized AI assistant for metadata annotation, improving efficiency, accuracy, and standardization across large-scale omics datasets.

Approach

  • Systematic validation with fine-tuning and testing
  • LLM-based metadata extraction with optimized prompts
  • Retrieval Augmented Generation (RAG) for enhanced data retrieval
  • Normalization, ontology mapping, and AI-driven standardization

Results & Value:

  • Drastically reduced annotation time from ~3 hours to just 5 minutes
  • Expert-assessed accuracy exceeding 80%
  • Fully integrated and deployed within the client’s cloud environment

This AI-driven solution revolutionizes metadata processing, enabling faster, more reliable, and scalable integration of public omics data, ultimately accelerating research and discovery.

Expert Contribution

Reviewed by: Dr. Piotr Faba, PhD
Role: Director of Software Engineering, AI‑Driven Drug Discovery
Expertise: Data integration, MLOps, cloud-native AI solutions, advanced analytics, life sciences data management

You might be also interested in:

Blog cover for Ardigen publication on ARDisplay-I and MHC ligand identification in Molecular & Cellular Proteomics
New publication in MCP: Improving MHC ligand identification with machine learning and optimized isolation
Fluorescence microscopy image of cells stained with multiple Cell Painting dyes showing cellular organelles in green, blue, and pink, overlaid with Ardigen brand graphic elements indicating phenomics data in durg discovery
End to End Data-to-Decision Journey for AI-Driven Phenomics in Drug Discovery
Abstract network visualization representing AI-driven integration of biological data and knowledge graphs for target identification in drug discovery.
Target Identification: From Poor Data to Quality Predictions
Abstract data streams representing data sourcing in pharmaceutical research and AI drug discovery
What Are Common Data Sourcing Patterns in Pharmaceutical Research (part 3)

Contact

Ready to transform drug discovery?

Discover how one of the top AI CROs in the world, can be your trusted partner in revolutionizing drug discovery through AI.

Contact us today to learn more about our tailored solutions for empowering your drug development journey.

Send us a message and we will contact you back within 48 hours.

Newsletter

Become an insider

Be the first to know about Ardigen’s latest news and get access to our publications, webinars and more!