citiususc/pyplexity

Cleaning tool for web scraped text

40
/ 100
Emerging

This tool helps researchers, data analysts, or content strategists clean up large amounts of web-scraped text. It takes raw HTML or WARC files, removes messy HTML tags and unwanted 'boilerplate' sentences, then outputs cleaner, more focused text or documents. This is ideal for anyone working with web data who needs to prepare it for analysis or further processing.

No commits in the last 6 months. Available on PyPI.

Use this if you need to systematically remove noise, like HTML tags or irrelevant sentences, from large collections of web-scraped documents to improve their quality for tasks like content analysis or information extraction.

Not ideal if you're dealing with very small, manually curated text sets or if your primary need is complex linguistic annotation beyond simple cleaning.

web-scraping text-cleaning content-analysis data-preparation information-extraction
Stale 6m
Maintenance 0 / 25
Adoption 7 / 25
Maturity 25 / 25
Community 8 / 25

How are scores calculated?

Stars

38

Forks

3

Language

Python

License

GPL-3.0

Last pushed

Jun 07, 2023

Commits (30d)

0

Dependencies

10

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/citiususc/pyplexity"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.