omicsNLP/Auto-CORPus

Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London collaboration to standardise text and table data extracted from full text publications. See Open Access publication at: https://doi.org/10.3389/fdgth.2022.788124.

52
/ 100
Established

This tool helps biomedical researchers and text analysts standardize information extracted from scientific publications. It takes raw HTML or PDF files of articles, including associated tables, and converts them into structured BioC-formatted text, JSON tables, and a list of abbreviations. This makes it easier to analyze large volumes of research data programmatically.

Use this if you need to systematically extract and standardize data from a collection of biomedical research papers (HTML or PDF) for text analytics or database population.

Not ideal if you need to retrieve publication files directly from publishers or if your primary data source is not scientific articles in HTML or PDF format.

biomedical-research scientific-publishing text-analytics data-extraction research-data-management
No Package No Dependents
Maintenance 13 / 25
Adoption 6 / 25
Maturity 16 / 25
Community 17 / 25

How are scores calculated?

Stars

22

Forks

11

Language

HTML

License

GPL-3.0

Last pushed

Mar 16, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/omicsNLP/Auto-CORPus"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.