alinapetukhova/textcl
Text preprocessing package for use in NLP tasks https://pypi.org/project/textcl/
This tool helps data scientists and NLP practitioners prepare messy text data for analysis or model training. It takes raw text, often from sources like OCR or web scrapes, and cleans it by filtering out irrelevant languages, fixing garbled sentences, removing duplicates, and identifying off-topic content. The output is a refined dataset ready for use in building predictive models, classifications, or text generation systems.
No commits in the last 6 months. Available on PyPI.
Use this if you need to thoroughly clean large volumes of text data that are noisy, multilingual, repetitive, or contain irrelevant sections before performing any Natural Language Processing tasks.
Not ideal if your text data is already perfectly clean, consistently formatted, and free of extraneous content.
Stars
11
Forks
4
Language
Python
License
MIT
Category
Last pushed
Aug 09, 2024
Commits (30d)
0
Dependencies
10
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/alinapetukhova/textcl"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
chartbeat-labs/textacy
NLP, before and after spaCy
nltk/nltk_data
NLTK Data
brightertiger/pygarble
Python Package to detect garbled, gibberish text for EN
jfilter/clean-text
🧹 Python package for text cleaning
prasanthg3/cleantext
An open-source package for python to clean raw text data