hollygrimm/voice-dataset-creation

Tools to create your own voice dataset for TTS training

/ 100

Emerging

This project helps Indigenous communities and language preservationists create high-quality voice datasets for training text-to-speech (TTS) AI models. It takes raw or existing audio recordings of speech and processes them into a structured dataset in the LJSpeech format, complete with transcripts and metadata. The intended users are community members, linguists, and cultural heritage managers working on language revitalization.

No commits in the last 6 months.

Use this if you need to build a speech dataset for an endangered language, ensuring proper data governance and community consent.

Not ideal if you are looking for a pre-built, off-the-shelf dataset for a major global language or if your primary goal is general speech recognition model training rather than TTS.

language-revitalization Indigenous-data-governance cultural-heritage linguistic-documentation speech-technology

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 8 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

—

Higher-rated alternatives

hetpandya/youtube_tts_data_generator

A python library to generate speech dataset from Youtube videos

IS2AI/Kazakh_TTS

An expanded version of the previously released Kazakh text-to-speech (KazakhTTS) synthesis...

taresh18/TTSizer

🎙️ Automatically transcribe audio/video into high-quality, speaker-specific Text-To-Speech datasets ✨

Hecate2/sukasuka-vocal-dataset-builder

すかすかアニメボカロデータセット。1st anime vocal dataset. Extract audio (vocal) files from video based on .ass...

youmebangbang/TTS-dataset-tools

Automatically generates TTS dataset using audio and associated text. Make cuts under a custom...

Explore Voice AI Tools

All categories Trending Voice AI directory Insights