Multilingual Speech Datasets Voice AI Tools

Curated speech corpora and audio datasets across multiple languages for training ASR and speech processing models. Does NOT include text-to-speech synthesis, voice cloning, or speech recognition inference tools.

There are 17 multilingual speech datasets tools tracked. The highest-rated is qianchang/zici at 43/100 with 31 stars.

Get all 17 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=voice-ai&subcategory=multilingual-speech-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 qianchang/zici

字词:收集国学/汉语字词拼音相关资源

43
Emerging
2 gheyret/UQSpeechDataset

Uyghur Single Speaker Speech Dataset. ウイグル語音声データセット

40
Emerging
3 speechio/BigCiDian

Pronunciation lexicon covering both English and Chinese languages for...

40
Emerging
4 apluka34/Bud500

Bud500: A Comprehensive Vietnamese ASR Dataset

40
Emerging
5 harisbinzia/PronouncUR

PronouncUR: An Urdu Pronunciation Lexicon Generator

39
Emerging
6 jonsafari/buckeye_dict

Buckeye Pronunciation Dictionary

31
Emerging
7 gheyret/thuyg20_scripts

Script files of THUYG-20(A free Uyghur speech database Released by...

22
Experimental
8 skit-ai/phone-number-entity-dataset

Dataset Release for Phone Number Entity capture task

21
Experimental
9 Nexdata-AI/100-Hours-Thai-Children-Spontaneous-Speech-Data

Thai Child's Spontaneous Speech Data

21
Experimental
10 Dragon745/urdu-roman-dictionary

A growing open-source Urdu → Roman Urdu dictionary and lexicon for...

19
Experimental
11 Nexdata-AI/650-Hours-Uyghur-Spontaneous-Speech-Data

650-Hours-Uyghur-Spontaneous-Speech-Data

14
Experimental
12 Nexdata-AI/347-Hours-Italian-Speech-Data-Collected-by-Mobile-Phone

Italian Speech Dataset

11
Experimental
13 Nexdata-AI/310-Hours-Turkish-Scripted-Monologue-Smartphone-Speech-Dataset

310-Hours-Turkish-Scripted-Monologue-Smartphone-Speech-Dataset

10
Experimental
14 nakhunchumpolsathien/Thai-ASR-OutOfTheBox-Test-Set

Out-of-the-box test sets for validating Thai automatic speech recognition system

10
Experimental
15 xx205/switchboard_training_in_minutes

PyTorch with horovod setup for distributed training of Switchboard-1 Phase 1...

10
Experimental
16 Nexdata-AI/233-Hours-Finnish-Spontaneous-Speech-Data

Finnish Spontaneous Speech Data

10
Experimental
17 Nexdata-AI/225-Hours-Swedish-Spontaneous-Speech-Data

Swedish Spontaneous Speech Data

10
Experimental