citiususc/pyplexity

Cleaning tool for web scraped text

/ 100

Emerging

This tool helps researchers, data analysts, or content strategists clean up large amounts of web-scraped text. It takes raw HTML or WARC files, removes messy HTML tags and unwanted 'boilerplate' sentences, then outputs cleaner, more focused text or documents. This is ideal for anyone working with web data who needs to prepare it for analysis or further processing.

No commits in the last 6 months. Available on PyPI.

Use this if you need to systematically remove noise, like HTML tags or irrelevant sentences, from large collections of web-scraped documents to improve their quality for tasks like content analysis or information extraction.

Not ideal if you're dealing with very small, manually curated text sets or if your primary need is complex linguistic annotation beyond simple cleaning.

web-scraping text-cleaning content-analysis data-preparation information-extraction

Stale 6m

Maintenance 0 / 25

Adoption 7 / 25

Maturity 25 / 25

Community 8 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

GPL-3.0

Higher-rated alternatives

chartbeat-labs/textacy

NLP, before and after spaCy

nltk/nltk_data

NLTK Data

brightertiger/pygarble

Python Package to detect garbled, gibberish text for EN

jfilter/clean-text

🧹 Python package for text cleaning

prasanthg3/cleantext

An open-source package for python to clean raw text data

Explore NLP Tools

All categories Trending NLP directory Insights