omicsNLP/Auto-CORPus
Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London collaboration to standardise text and table data extracted from full text publications. See Open Access publication at: https://doi.org/10.3389/fdgth.2022.788124.
This tool helps biomedical researchers and text analysts standardize information extracted from scientific publications. It takes raw HTML or PDF files of articles, including associated tables, and converts them into structured BioC-formatted text, JSON tables, and a list of abbreviations. This makes it easier to analyze large volumes of research data programmatically.
Use this if you need to systematically extract and standardize data from a collection of biomedical research papers (HTML or PDF) for text analytics or database population.
Not ideal if you need to retrieve publication files directly from publishers or if your primary data source is not scientific articles in HTML or PDF format.
Stars
22
Forks
11
Language
HTML
License
GPL-3.0
Category
Last pushed
Mar 16, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/omicsNLP/Auto-CORPus"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
natasha/corus
Links to Russian corpora + Python functions for loading and parsing
darija-open-dataset/dataset
darija <-> english dataset
SergeyShk/ruTS
Библиотека для извлечения статистик из текстов на русском языке.
texttechnologylab/GerParCor
German Parliamentary Corpus (GerParCor)