Speech Corpora Datasets Voice AI Tools

Collections and catalogs of annotated speech audio data for training ASR, TTS, and voice AI models. Does NOT include tools for processing/cleaning datasets, annotation pipelines, or model implementations.

There are 72 speech corpora datasets tools tracked. 4 score above 50 (established tier). The highest-rated is ynop/audiomate at 54/100 with 138 stars.

Get all 72 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=voice-ai&subcategory=speech-corpora-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 ynop/audiomate

Python library for handling audio datasets.

54
Established
2 reazon-research/ReazonSpeech

Massive open Japanese speech corpus

52
Established
3 common-voice/cv-dataset

Metadata and versioning details for the Common Voice dataset

50
Established
4 davidmartinrius/speech-dataset-generator

🔊 Create labeled datasets, enhance audio quality, identify speakers, support...

50
Established
5 EgorLakomkin/KTSpeechCrawler

Automatically constructing corpus for automatic speech recognition from...

47
Emerging
6 coqui-ai/open-speech-corpora

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

46
Emerging
7 yc9701/pansori

Tools for ASR Corpus Generation from Online Video

46
Emerging
8 Niger-Volta-LTI/yoruba-text

Yorùbá language training text for NLP, ASR and TTS tasks

45
Emerging
9 jim-schwoebel/download_audioset

📁 This repo makes it easy to download the raw audio files from AudioSet...

44
Emerging
10 Appen/UHV-OTS-Speech

A data annotation pipeline to generate high-quality, large-scale speech...

43
Emerging
11 candlewill/Speech-Corpus-Collection

A Collection of Speech Corpus for ASR and TTS

43
Emerging
12 dsfsi/dsfsi-datasets

Official DSFSI Public Datasets Registry - Comprehensive catalog of 50+...

41
Emerging
13 Umbaji/NMTMD

Official repository for the Opensource Textdataset for NMT for local langues...

40
Emerging
14 robmsmt/ASR-Audio-Data-Links

A list of publically available audio data that anyone can download for ASR...

40
Emerging
15 unza-speech-lab/zambezi-voice

Repository for multilingual speech data resources for native languages of Zambia.

39
Emerging
16 wspr-ncsu/robocall-audio-dataset

A dataset of real-world robocall audio recordings

38
Emerging
17 AsoSoft/AsoSoft-TTS-Speech-Corpus-for-Central-Kurdish

AsoSoft Speech Corpus for Central-Kurdish Text-To-Speech

37
Emerging
18 silenterus/deepspeech-cleaner

Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework

36
Emerging
19 yc9701/pansori-tedxkr-corpus

Korean ASR Corpus generated from TEDx talks

35
Emerging
20 IS2AI/ISSAI_SAIDA_Kazakh_ASR

the first industrial-scale open-source Kazakh speech corpus. KSC2 corpus...

35
Emerging
21 khuangaf/ITRI-speech-recognition-dataset-generation

Automatic Speech Recognition Dataset Generation

34
Emerging
22 xinjli/ucla-phonetic-corpus

Dataset of ICASSP 2021 MULTILINGUAL PHONETIC DATASET FOR LOW RESOURCE SPEECH...

33
Emerging
23 egorsmkv/asr-corpus-creator

This app is intended to automatically create a corpus for ASR systems using...

33
Emerging
24 AI-TOOLKIT/VoiceData

Automatic Speech Recognition (ASR) Data Generator Toolkit

33
Emerging
25 swarms/mozilla-common-voice

Swarms supports the Common Voice Project from Mozilla! This repo contains...

33
Emerging
26 PranavMishra17/VoicePersona-Dataset

A comprehensive voice persona dataset for character consistency in voice...

33
Emerging
27 cyrta/broadcast-news-videos-dataset

Collection of broadcast news video clips

32
Emerging
28 turinaf/Sagalee

Automatic Speech Recognition Dataset for Oromo Language

32
Emerging
29 csikasote/bigc

This repository contains the data resources for the LacunaFund supported...

32
Emerging
30 BYO-UPM/Neurovoz_Dababase

Neurovoz corpus of parkinosnian speech

31
Emerging
31 goodmike31/pl-asr-speech-data-survey

Survey of available speech datasets for Polish ASR development

31
Emerging
32 skit-ai/speech-to-intent-dataset

Dataset Release for Intent Classification from Speech

31
Emerging
33 Niger-Volta-LTI/urhobo-asr-spoken-digits

URH-DIGITS is a connected digits speech recognition task

31
Emerging
34 jhdeov/armenian-intonation

Repository of question-answer dialogues of Armenian, for an intonation study.

31
Emerging
35 zhongyuchen/DSPSpeech-20

A speech dataset of 20 isolated words each with 680 recordings from 34 individuals

30
Emerging
36 97jamie/public-police-footage

Code for Constructing Datasets From Public Police Body Camera Footage (ICASSP 2025)

30
Emerging
37 r9y9/jsut-lab

HTS-style full-context labels for JSUT v1.1

29
Experimental
38 german-asr/megs

A merged version of multiple open-source German speech datasets.

29
Experimental
39 rusiaaman/PCPM

Presenting Collection of Pretrained Models. Links to pretrained models in...

29
Experimental
40 Prem-kumar27/Fast-KTSpeechCrawler

Parallelized automatic corpus collection for ASR. Forked from...

29
Experimental
41 Pogayo/african-voices-web

Website that hosts the African Voices projects. Users can download datasets...

27
Experimental
42 qcri/Arabic_speech_code_switching

The first Dialectal Arabic Code Switching - DACS corpus from broadcast...

27
Experimental
43 bunyaminergen/awesome-speech-dataset

Awesome Speech Dataset, including download links and a brief explanation for...

25
Experimental
44 apluka34/audio-crawler

A tool for crawling and creating audio dataset

23
Experimental
45 Anwarvic/mTEDx_auxiliary

These are different files I created to do different tasks when I was working...

22
Experimental
46 motazsaad/jsc-news-broadcast

JSC news broadcast (speech corpus)

22
Experimental
47 antouanbg/Bulgarian_Linguistic

Collection and resources for Bulgarian Corpus, Datasets and Models used in...

22
Experimental
48 labsensacional/ASMRDataset

Recordings and transcriptions of ASMR artists compiled for the purpose of...

21
Experimental
49 czyzi0/the-mc-speech-dataset

Free speech dataset consisting of 24018 short audio clips of a single...

21
Experimental
50 jp1924/HF_builders

🤗 Datasets의 builder script를 모와둔 repo

21
Experimental
51 vislupus/Bulgarian-TTS-dataset

LibriVox dataset for Bulgarian language TTS

20
Experimental
52 harveenchadha/Speech-Learning-Resources

Repo containing resources to learn about various verticals of speech. ASR , TTS

20
Experimental
53 egorsmkv/asr-datasets-cleaner

A pipeline to make ASR datasets better

19
Experimental
54 weimeng23/audio-speech-datasets

:scroll: A list of various Audio/Speech datasets about Speech Recognition,...

19
Experimental
55 ubisoft/ubisoft-laforge-french-homograph-dataset

Dataset for La Forge Speech Synthesis System Submission to the Blizzard...

19
Experimental
56 Rumeysakeskin/Speech-Datasets-for-ASR

Download speech datasets (English and non-English) for Automatic Speech Recognition

19
Experimental
57 mrcraked/WordAudio

A massive collection of high-quality MP3 word pronunciations. Download,...

18
Experimental
58 navalnica/be_nlp_speech_resources

Links to Belarusian NLP and Speech resources

18
Experimental
59 Aditya-ds-1806/Alar-voice-corpus

Voice corpus for the Alar Kannada-English Dictionary

17
Experimental
60 Umbaji/Yodi

This is the official repository for Yodi, the speech recognition model for 8...

17
Experimental
61 speakingofdata/80_Excerpts

4 voices x 80 transcripts = 320 audio recordings

15
Experimental
62 Giuseppe-Della-Corte/IESTAC

A corpus that can be used to train English-to-Italian End-to-End...

13
Experimental
63 carlfm01/my-speech-datasets

My public domain speech index

13
Experimental
64 nafiuny/voice_conversion_dataset

top dataset for voice conversion models

11
Experimental
65 speakingofdata/LJ2_Corpus

Single speaker, 26,200 transcribed audio recordings, 48 hours total

11
Experimental
66 Mormolykos/bedvibe-datasets

Multilingual emotional speech datasets for TTS training

11
Experimental
67 kan-bayashi/VCTKCorpusFullContextLabel

Full context label for VCTK Corpus.

11
Experimental
68 Litee/tts-asr-corpora

Catalogue of TTS and ASR corpora that can be used for machine learning

11
Experimental
69 Umbaji/umini_speech

This is the official repository for the training of Yodi V1, the frst speech...

10
Experimental
70 IrinaKipyatkova/AnKaS

AnKaS: Database of Annotations of Karelian Speech

10
Experimental
71 mandeebot/Berom_Speech_Dataset

This repo is a work in progress, building a Speech corpus for Berom, a low...

10
Experimental
72 jasminsternkopf/corpus_design_with_greedy_algorithm

Building a corpus whose unit distribution is approximately the same as a...

10
Experimental