worldbank/wb-nlp-tools
Natural language processing tools developed by the World Bank's DECAT unit. A suite of text preprocessing and cleaning algorithms for NLP analysis and modeling.
This suite of tools helps researchers and analysts efficiently prepare large volumes of text from documents like PDFs for natural language processing. It takes raw text or PDFs, cleans them by correcting spelling, expanding acronyms, and identifying key phrases, and outputs high-quality, structured text ready for analysis or modeling. This is ideal for economists, social scientists, or policy researchers working with extensive textual data.
No commits in the last 6 months.
Use this if you need to transform messy, real-world documents into clean, consistent text for tasks like topic modeling, sentiment analysis, or information extraction.
Not ideal if you primarily work with structured data, require only basic text search, or need highly specialized linguistic analysis not covered by standard cleaning and phrase detection.
Stars
10
Forks
7
Language
Python
License
MIT
Category
Last pushed
Jun 11, 2022
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/worldbank/wb-nlp-tools"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
sloria/TextBlob
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase...
chrismattmann/tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called...
cltk/cltk
The Classical Language Toolkit
allenai/scispacy
A full spaCy pipeline and models for scientific/biomedical documents.
wi2trier/cbrkit
Customizable Case-Based Reasoning (CBR) toolkit for Python with a built-in API and CLI.