PMC text mining subset in BioC: about three million full-text articles and growing

Bioinformatics. 2019 Sep 15;35(18):3533-3535. doi: 10.1093/bioinformatics/btz070.

Abstract

Motivation: Interest in text mining full-text biomedical research articles is growing. To facilitate automated processing of nearly 3 million full-text articles (in PubMed Central® Open Access and Author Manuscript subsets) and to improve interoperability, we convert these articles to BioC, a community-driven simple data structure in either XML or JavaScript Object Notation format for conveniently sharing text and annotations.

Results: The resultant articles can be downloaded via both File Transfer Protocol for bulk access and a Web API for updates or a more focused collection. Since the availability of the Web API in 2017, our BioC collection has been widely used by the research community.

Availability and implementation: https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/.

Publication types

  • Research Support, N.I.H., Intramural

MeSH terms

  • Algorithms
  • Biomedical Research
  • Data Mining*
  • PubMed