Share news in:
08 January 2025

Harnessing Large Language Models (LLMs) for Metadata Annotation to Accelerate Biotech and Pharma Research

This blog post highlights how large language models (LLMs) can serve as AI Assistants to automate metadata annotation, helping expert teams in the curation of biological databases. We present a case study demonstrating the application of LLMs in annotating the NCBI’s Gene Expression Omnibus (GEO) repository to showcase the utility of AI-assisted workflows on translational research and drug discovery.

Table of Contents:

  1. Advantages of LLMs for metadata annotation/a>
  2. Existing tools and frameworks/a>
  3. Case study: Using AI Assistant to integrate and annotate the NCBI’s Gene Expression Omnibus (GEO)
  4. The future of metadata annotation: LLM-based AI assistants for experts

AI Assistant tailored for metadata annotation represents a valuable tool for collecting and unifying important context for the analysis of omics studies and other cohorts of bioinformatic datasets. This approach involves optimizing large language models (LLMs) and natural language processing (NLP) algorithms to categorize, organize and standardize biological datasets, helping improve the accuracy, completeness and usability of the datasets as well as speed up the annotation process by reducing the time-consuming and repetitive workload associated with the manual curation of such datasets.

Biological metadata often lacks standardization, making it difficult to compare across multiple studies since the proper knowledge exists within multiple inconsistent sources, predominantly of an unstructured nature such as publications, abstracts, supplementary data, etc. Metadata includes information such as the study’s purpose, experimental design, sample characteristics, protocols used, data processing methods and relevant ontologies or controlled vocabulary terms to describe the dataset. Traditionally, this type of data is annotated manually by expert curators. However, the speed of generating new genomic, transcriptomic and other clinically relevant datasets largely outpaced the capabilities of even the most proficient expert teams.

Utilizing AI assistants based on LLMs can significantly simplify and accelerate this process, allowing researchers to focus on the most important, scientifically relevant aspects of the data instead of repetitive and laborious work. Automation allows researchers to use the data more effectively, implement larger, aggregated datasets (such as ontologies, public data banks and atlases) and extract relevant insights to empower translational, early discovery and target identification research.

Advantages of LLMs for metadata annotation

The main advantage of using AI Assistants based on LLMs is that they can support curators to process vast large datasets much faster with high accuracy. For example, manually annotating a single gene expression study can take approximately 3 hours, while a tailored AI workflow can complete the same task in just 5 minutes—as much as 20 times faster. An optimized LLM excels at identifying and extracting complex biological terminologies, such as disease state, tissue and treatment type, etc from unstructured knowledge modalities – publication, abstracts and free text descriptions of experiments. These models are scalable and can be fine-tuned on domain-specific datasets to improve their relevance to specific tasks.

This approach can be effectively utilized for entity recognition and classification. For example, a researcher may want to extract gene names, pathways, diseases, experimental conditions or sample types from free-text metadata. LLMs can identify and classify biological entities using contextual understanding, resolve ambiguities in gene and protein names (e.g., human-readable contexts of “BRCA1”) as well as suggest appropriate terms for unstructured descriptions, improving interoperability and data reuse. LLMs are also extremely helpful in ontology mapping and standardization, where biological metadata often lacks standardization. LLMs automatically map free-text annotations to controlled vocabularies like Gene Ontology (GO) and Disease Ontology (DOID). From our experience, when assessing the results of automated annotation, LLMs demonstrate greater than 80% strict accuracy, enabling an efficient Assistant-Curator workflow.

Existing tools and frameworks

Although there are bio-specific LLMs such as BioGPT, PubMedBERT and SciBERT, which are fine-tuned for scientific and biomedical text, their performance and usability are limited. In practice, a stable and usable application can be achieved using one of the state-of-art general models, supplemented with the context of Ontologies and databases such as UniProt or EFO and optimized within a Retrieval-Augmented Generation (RAG) framework. No less important is the ability to integrate the AI assistant with existing bioinformatic workflows and curation processes, so that they become an inherent element of experts’ toolkits. This ensures the teams see the benefits right away and allows for validation and iterative improvement of the models.

Case study: Using AI Assistant  to integrate and annotate the NCBI’s Gene Expression Omnibus (GEO)

The NCBI’s Gene Expression Omnibus (GEO) is a public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput omics data submitted by the research community. There are a total of 35,000 clinically relevant human gene expression studies available in GEO. However, a major limitation to leveraging the full richness of GEO is the lack of standardized, consistent annotation that would make this dataset searchable and usable across studies. Annotations such as experimental conditions and sample types are essential for ensuring reproducibility, facilitating accurate analysis and integrating data from multiple sources to derive replicable, highly confident predictions. However, important features of the experiments are often hidden within paragraphs of accompanying scientific publications. 

At Ardigen, we built an LLM tool to automate the annotation process for GEO. This tool was based on enterprise-grade LLMs such as GPT-4 or Gemini, implemented within a Retrieval-Augmented Generation (RAG) framework and supported by an underlying vector database. We used it to automatically annotate key fields in GEO studies, such as tissue, condition, drug and intervention, demonstrating its utility.

The annotation was verified by human experts in biology and evaluated over a pre-existing curated dataset of a few hundred studies. The LLM annotator was able to annotate datasets automatically with good accuracy, annotating dozens of datasets in minutes.

Figure 1: Technical overview of the pipeline

Figure 1: Technical overview of the pipeline

To maximize practical value, we connected the AI tool with existing annotation pipelines and internal processes of the Annotator Team, including live Connectors and APIs to facilitate data sharing between the client’s existing services. This solution was fully integrated and deployed within the client’s cloud environment (Figure 1). Ultimately, a cascade of various AI models was used to improve the efficiency and performance of the workflow, decreasing the time needed for primary annotation of a single study from 3 hours to 5 minutes (a 20-fold improvement). The expert-assessed accuracy of the solution exceeded 80% on average (Table 1). 

The implementation of this custom LLM annotator delivered a significant boost to the productivity and scale of the client’s Annotator Experts Team. It now enables the team to process and standardize tens of thousands studies, which is critical for their drug discovery research.

 

Target / Expected Predicted by Model
blood leukocytes blood
respiratory epithelium airway
lung lung
non-small cell lung carcinoma non-small cell lung cancer
hepatocellular carcinoma liver cancer
esophageal cancer esophageal cancer
prostate carcinoma prostate cancer
cervical cancer cervical squamous epithelial cancer
breast cancer breast invasive ductal carcinoma
colorectal cancer colorectal adenocarcinoma
colorectal cancer gastroenteritis
coronary artery disease hyperlipidemia

Table 1: Example annotations for fields “tissue” and “condition”. One can observe that very often annotations returned from LLM are alternative, correct answers.

The future of metadata annotation: LLM-based AI assistants for experts

In addition to the examples of projects described above, there are several other applications of LLMs. For example, performing comprehensive annotation for just one single-cell atlas can take 4-5 months, while utilizing an LLM can reduce this time to a few weeks. Public data repositories are immense and contain omics modalities beyond gene expression or from other species. Considering that there are over 7.5 million samples within 240,000 experiments, manually annotating all of them is simply not feasible. With LLM-assisted automation, followed by expert curation, such an ambitious vision becomes possible, enabling potentially ground-breaking studies.  

In the future, LLMs will become more sophisticated, following human-expert feedback to improve the accuracy of annotation and expand the capabilities of AI-assistant metadata annotation tools. This includes expanding the number of classifiable fields from dozens to hundreds or even thousands, including cell types, treatment conditions, ranges and much more. This will significantly increase the productivity of biotech and pharma research teams and enable experts to focus on the big-picture vision to empower scientific progress. 

 

 

Are you interested in Ai in drug discovery and would like more details? Get in touch!


31 December 2024
AI and Big Data in Healthcare: Opportunities and Challenges
16 January 2025
Optimizing Bioinformatics Workflows: Why it matters and when to start thinking about it?
Go up