SyRACT: Zero-shot Biomedical Document level Relation Extraction with Synergistic RAG and CoT

Bioinformatics. 2025 Jun 19:btaf356. doi: 10.1093/bioinformatics/btaf356. Online ahead of print.

Abstract

Motivation: With the advancement of large language models (LLMs), the field of biomedical document level relation extraction (BioDocRE) has encountered new opportunities. However, LLMs often face challenges such as hallucinated generation, insufficient reasoning capabilities, and a lack of interpretability when performing relation extraction tasks.

Results: To address these issues, we propose the SyRACT (Synergistic Retrieval Augmented Generation and Chain of Thought) framework for high precision relation extraction in biomedical documents. This framework is built around three core strategies: (i) reframing the relation extraction task as a question answering problem to better align with the processing logic of LLMs; (ii) leveraging an external database constructed from PubMed to provide LLMs with rich and reliable contextual information, thus mitigating hallucination generation; and (iii) construct a specific Chain of Thought for BioDocRE tasks, thereby enhancing the model's reasoning ability and the interpretability of its output. We validated this approach on three biomedical relation extraction datasets: CDR, GDA, and ADE. Experimental results show that the SyRACT model improves F1 scores by 11.04%, 9.10%, and 41.00% on three datasets, respectively, compared to the DocRE method, which uses standard prompts for LLMs.

Availability: Our source code and data are available at https://github.com/donggggxin/SyRACT.

Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords: Chain of thought; Document level relation extraction; Retrieval augmented generation.