citiususc/pyplexity
Cleaning tool for web scraped text
This tool helps researchers, data analysts, or content strategists clean up large amounts of web-scraped text. It takes raw HTML or WARC files, removes messy HTML tags and unwanted 'boilerplate' sentences, then outputs cleaner, more focused text or documents. This is ideal for anyone working with web data who needs to prepare it for analysis or further processing.
No commits in the last 6 months. Available on PyPI.
Use this if you need to systematically remove noise, like HTML tags or irrelevant sentences, from large collections of web-scraped documents to improve their quality for tasks like content analysis or information extraction.
Not ideal if you're dealing with very small, manually curated text sets or if your primary need is complex linguistic annotation beyond simple cleaning.
Stars
38
Forks
3
Language
Python
License
GPL-3.0
Category
Last pushed
Jun 07, 2023
Commits (30d)
0
Dependencies
10
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/citiususc/pyplexity"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
chartbeat-labs/textacy
NLP, before and after spaCy
nltk/nltk_data
NLTK Data
brightertiger/pygarble
Python Package to detect garbled, gibberish text for EN
jfilter/clean-text
🧹 Python package for text cleaning
prasanthg3/cleantext
An open-source package for python to clean raw text data