LoLei/redditcleaner
Cleans Reddit Text Data :scroll: :broom:
When analyzing text data from Reddit, you often encounter special formatting like bolding, links, and code blocks that interfere with your analysis. This tool takes raw Reddit comments or submission self-texts, which can be full of Markdown and HTML entities, and outputs plain, readable text by removing these Reddit-specific characters. It's ideal for data scientists or researchers working with social media data.
No commits in the last 6 months. Available on PyPI.
Use this if you need to prepare Reddit text data for natural language processing or other data science tasks by stripping away Reddit-specific formatting.
Not ideal if you need to remove common punctuation, numbers, or emojis, as this tool specifically targets Reddit's unique formatting.
Stars
83
Forks
2
Language
Python
License
MIT
Category
Last pushed
Apr 14, 2020
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/LoLei/redditcleaner"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
chartbeat-labs/textacy
NLP, before and after spaCy
nltk/nltk_data
NLTK Data
brightertiger/pygarble
Python Package to detect garbled, gibberish text for EN
jfilter/clean-text
🧹 Python package for text cleaning
prasanthg3/cleantext
An open-source package for python to clean raw text data