Automating GEO Metadata Annotation with LLMs
About the poster
NCBI’s Gene Expression Omnibus (GEO) provides valuable gene expression and functional genomics data, but a major challenge is the lack of consistent, standardized annotation. Proper annotations, including experimental conditions and sample types, are essential for making datasets searchable, comparable, and usable across studies. These annotations are crucial for integrating data from multiple sources, facilitating accurate analysis, and ensuring reproducibility, which is key for advancing scientific discoveries. However, manual annotation is time-consuming, prone to errors, and slows down scientific progress.
To address these challenges, we developed an automated tool based on large language models (LLMs) that streamlines the annotation process. This tool detects and extracts relevant metadata, ensuring consistency and reducing human error. A minimum viable product (MVP) was developed to automatically annotate four key fields in GEO studies: Condition, Tissue, Drug and Intervention, demonstrating the potential of AI-driven techniques to enhance accuracy and accelerate biological research.
This poster was originally presented during the BioIT 2025 Conference.