hollygrimm/voice-dataset-creation

Tools to create your own voice dataset for TTS training

30
/ 100
Emerging

This project helps Indigenous communities and language preservationists create high-quality voice datasets for training text-to-speech (TTS) AI models. It takes raw or existing audio recordings of speech and processes them into a structured dataset in the LJSpeech format, complete with transcripts and metadata. The intended users are community members, linguists, and cultural heritage managers working on language revitalization.

No commits in the last 6 months.

Use this if you need to build a speech dataset for an endangered language, ensuring proper data governance and community consent.

Not ideal if you are looking for a pre-built, off-the-shelf dataset for a major global language or if your primary goal is general speech recognition model training rather than TTS.

language-revitalization Indigenous-data-governance cultural-heritage linguistic-documentation speech-technology
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 9 / 25
Maturity 8 / 25
Community 13 / 25

How are scores calculated?

Stars

71

Forks

8

Language

Jupyter Notebook

License

Last pushed

Oct 26, 2020

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/hollygrimm/voice-dataset-creation"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.