worldbank/wb-nlp-tools

Natural language processing tools developed by the World Bank's DECAT unit. A suite of text preprocessing and cleaning algorithms for NLP analysis and modeling.

/ 100

Emerging

This suite of tools helps researchers and analysts efficiently prepare large volumes of text from documents like PDFs for natural language processing. It takes raw text or PDFs, cleans them by correcting spelling, expanding acronyms, and identifying key phrases, and outputs high-quality, structured text ready for analysis or modeling. This is ideal for economists, social scientists, or policy researchers working with extensive textual data.

No commits in the last 6 months.

Use this if you need to transform messy, real-world documents into clean, consistent text for tasks like topic modeling, sentiment analysis, or information extraction.

Not ideal if you primarily work with structured data, require only basic text search, or need highly specialized linguistic analysis not covered by standard cleaning and phrase detection.

policy-research social-science-research economic-analysis development-studies document-analysis

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

sloria/TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase...

chrismattmann/tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called...

cltk/cltk

The Classical Language Toolkit

allenai/scispacy

A full spaCy pipeline and models for scientific/biomedical documents.

wi2trier/cbrkit

Customizable Case-Based Reasoning (CBR) toolkit for Python with a built-in API and CLI.

Explore NLP Tools

All categories Trending NLP directory Insights