The-Data-Dilemma/ParquetToHuggingFace

ParquetToHuggingFace processes raw audio data, converts it into Parquet files, and uploads them to Hugging Face. The README explains how to set up the environment, configure paths, and run the scripts to generate and upload the data.

37
/ 100
Emerging

This tool helps researchers and data scientists prepare and share audio datasets for machine learning. It takes raw audio recordings and their transcriptions, processes them into a standardized Parquet format, and then uploads them to Hugging Face, making your dataset easily accessible and shareable with the broader ML community. It's ideal for those working with audio for speech-to-text or translation tasks.

No commits in the last 6 months.

Use this if you need to convert your raw audio data and its corresponding text into a structured Parquet format and then publish it as a dataset on Hugging Face.

Not ideal if you are looking to analyze audio directly without preparing it for a machine learning dataset, or if you prefer not to use Hugging Face for data sharing.

audio-processing speech-recognition dataset-creation natural-language-processing machine-learning-research
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 5 / 25
Maturity 15 / 25
Community 15 / 25

How are scores calculated?

Stars

9

Forks

4

Language

Python

License

MIT

Last pushed

May 16, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/The-Data-Dilemma/ParquetToHuggingFace"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.