The-Data-Dilemma/ParquetToHuggingFace
ParquetToHuggingFace processes raw audio data, converts it into Parquet files, and uploads them to Hugging Face. The README explains how to set up the environment, configure paths, and run the scripts to generate and upload the data.
This tool helps researchers and data scientists prepare and share audio datasets for machine learning. It takes raw audio recordings and their transcriptions, processes them into a standardized Parquet format, and then uploads them to Hugging Face, making your dataset easily accessible and shareable with the broader ML community. It's ideal for those working with audio for speech-to-text or translation tasks.
No commits in the last 6 months.
Use this if you need to convert your raw audio data and its corresponding text into a structured Parquet format and then publish it as a dataset on Hugging Face.
Not ideal if you are looking to analyze audio directly without preparing it for a machine learning dataset, or if you prefer not to use Hugging Face for data sharing.
Stars
9
Forks
4
Language
Python
License
MIT
Category
Last pushed
May 16, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/The-Data-Dilemma/ParquetToHuggingFace"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Picovoice/rhino
On-device Speech-to-Intent engine powered by deep learning
yandexdataschool/speech_course
YSDA course in Speech Processing.
MycroftAI/adapt
Adapt Intent Parser
Picovoice/speech-to-intent-benchmark
benchmark for Speech-to-Intent engines
IBM/BigLittleNet
Official repository for Big-Little Net