lgomezt/tidyX
Python package to clean raw tweets for ML applications.
This tool helps researchers, marketers, or analysts transform messy, raw text, especially from social media platforms like Twitter and particularly in Spanish, into clean, structured data ready for analysis. It takes in tweets and other short-form text and outputs a streamlined version, free of noise like URLs, hashtags, and emojis, making it ideal for natural language processing applications. Anyone working with social media data who needs to prepare it for sentiment analysis, topic modeling, or other text-based insights would find this valuable.
No commits in the last 6 months.
Use this if you need to quickly and efficiently clean social media text, especially Spanish tweets, to prepare it for machine learning or other analytical tasks.
Not ideal if your primary need is for deep linguistic analysis or processing highly structured, formal text datasets outside of social media.
Stars
26
Forks
1
Language
Python
License
MIT
Category
Last pushed
Feb 20, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/lgomezt/tidyX"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
chartbeat-labs/textacy
NLP, before and after spaCy
nltk/nltk_data
NLTK Data
brightertiger/pygarble
Python Package to detect garbled, gibberish text for EN
jfilter/clean-text
🧹 Python package for text cleaning
prasanthg3/cleantext
An open-source package for python to clean raw text data