omicsNLP/Auto-CORPus

Auto-CORPus pipeline developed by a University of Nottingham and Imperial College London collaboration to standardise text and table data extracted from full text publications. See Open Access publication at: https://doi.org/10.3389/fdgth.2022.788124.

/ 100

Established

This tool helps biomedical researchers and text analysts standardize information extracted from scientific publications. It takes raw HTML or PDF files of articles, including associated tables, and converts them into structured BioC-formatted text, JSON tables, and a list of abbreviations. This makes it easier to analyze large volumes of research data programmatically.

Use this if you need to systematically extract and standardize data from a collection of biomedical research papers (HTML or PDF) for text analytics or database population.

Not ideal if you need to retrieve publication files directly from publishers or if your primary data source is not scientific articles in HTML or PDF format.

biomedical-research scientific-publishing text-analytics data-extraction research-data-management

No Package No Dependents

Maintenance 13 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

Forks

Language

HTML

License

GPL-3.0

Related tools

Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

natasha/corus

Links to Russian corpora + Python functions for loading and parsing

darija-open-dataset/dataset

darija <-> english dataset

SergeyShk/ruTS

Библиотека для извлечения статистик из текстов на русском языке.

texttechnologylab/GerParCor

German Parliamentary Corpus (GerParCor)

Explore NLP Tools

All categories Trending NLP directory Insights