From Single to Multi-Agent Reasoning: Advancing GeneGPT for Genomics QA

Abstract

Comprehending genomic information is essential for biomedical research, yet extracting data from complex distributed databases remains challenging. Large language models (LLMs) offer potential for genomic Question Answering (QA) but face limitations due to restricted access to domain-specific databases. GeneGPT is the current state-of-the-art system that enhances LLMs by utilizing specialized API calls, though it is constrained by rigid API dependencies and limited adaptability. We replicate GeneGPT and propose GenomAgent, a multi-agent framework that efficiently coordinates specialized agents for complex genomics queries. Evaluated on nine tasks from the GeneTuring benchmark, GenomAgent outperforms GeneGPT by 12% on average, and its flexible architecture extends beyond genomics to various scientific domains needing expert knowledge extraction.

Task & Motivation

The objective is to enable LLMs to comprehend and extract expert knowledge from complex, distributed genomic databases to answer natural language queries.

The work motivated by several limitations inherent in standard LLMs and the current state-of-the-art system, GeneGPT:

1. Complexity of Genomic Data: Extracting data from distributed biomedical databases remains a significant challenge for researchers. Standard LLMs struggle with this because they have restricted access to domain-specific databases.
2. Fragility of Existing Systems (GeneGPT): While GeneGPT is effective, it relies on a single-agent architecture with rigid dependencies on specific API formats. This makes the system fragile when interfacing with evolving tools.
3. Context and Focus Issues: GeneGPT relies on extensive context windows, which can lead to "attention dilution" where the model loses focus on the original query.

Experiments

Here, we report the performance and cost of GenomAgent in GeneTuring tasks compared to GeneGPT's main results. GenomAgent achieves substantial improvements in both performance and computational efficiency. Our model attains an average score of 0.93, exceeding the best-performing GeneGPT model (0.83).

In simple tasks (nomenclature and genomic location), our system achieves near-perfect performance with a score of 0.98, surpassing GeneGPT-slim's scores of 0.92 for nomenclature and 0.88 for genomic location. Most notably, in alignment tasks, which are the most challenging task for GeneGPT, we achieve a remarkable 28.8% improvement.

Computational cost analysis reveals even more striking improvements. GenomAgent costs only $2.11 total in all tasks (79.0% reduction from best-performing GeneGPT ($10.06)).

A figure showing our ProbeLog logit descriptors

Performance-cost tradeoff on GeneTuring.

In addition, we show GenomAgent is the optimal selection, as it achieves a high score at minimal computational expense. Bubble size shows normalized cost and High Value Region shows optimal performance at minimal cost.

A figure showing our zero-shot ProbeLog logit descriptors

BibTeX

@misc{abedini2026singlemultiagentreasoningadvancing,
      title={From Single to Multi-Agent Reasoning: Advancing GeneGPT for Genomics QA}, 
      author={Kimia Abedini and Farzad Shami and Gianmaria Silvello},
      year={2026},
      eprint={2601.10581},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.10581}, 
}