Reducing Data Curation Time with an LLM-Augmented ETL System: A GEO Case Study
About the poster
Data annotation and curation in the biomedical field is a challenging and time-consuming process. For instance, annotating a single cell studies from a GEO database can take an expert up to 60% of their time. This burdensome task diverts scientists’ valuable time away from testing novel hypotheses.
Manual annotation of GEO studies can take up to 4 days. Our pipeline reduces this to under 30 minutes – turning a multi-day bottleneck into a near real-time task.
To address this, Ardigen developed a scalable LLM-augmented ETL (Extract, Transform, Load) system that transforms a complex mix of multimodal knowledge into reusable data products . Our system combines state-of-the-art AI models like LLMs with traditional methods such as fuzzy matching, rule-based parsing, and controlled vocabularies to minimize the risk of AI hallucinations reaching 80% accuracy confirmed by experts.
The Impact
Ardigen’s approach drastically reduces the annotation time to just 15 minutes, empowering scientists to focus on higher-level research. The annotated data is easily accessible through an intuitive user experience. By delivering significant operational value through reduced curation overhead, our system accelerates the creation of AI-ready data repositories for drug discovery and clinical research.
This poster was originally presented during the BiotechX Europe 2025 Conference.