yc9701/pansori-tedxkr-corpus
Korean ASR Corpus generated from TEDx talks
This is a collection of Korean speech audio clips and their corresponding text transcripts, sourced from TEDx talks given in Korea between 2010 and 2014. It provides high-quality Korean speech data, about 3 hours in total from 41 speakers, as FLAC audio files and text pairs. Language researchers, AI developers, and speech technology engineers would use this to train or evaluate Korean speech recognition systems.
No commits in the last 6 months.
Use this if you need a pre-compiled, high-quality dataset of spoken Korean and its text for developing or testing speech recognition models.
Not ideal if you need a very large-scale corpus (this is about 3 hours) or require speech data from different domains or time periods beyond TEDx talks from 2010-2014.
Stars
27
Forks
4
Language
—
License
—
Category
Last pushed
Jan 11, 2019
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/yc9701/pansori-tedxkr-corpus"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ynop/audiomate
Python library for handling audio datasets.
reazon-research/ReazonSpeech
Massive open Japanese speech corpus
common-voice/cv-dataset
Metadata and versioning details for the Common Voice dataset
davidmartinrius/speech-dataset-generator
🔊 Create labeled datasets, enhance audio quality, identify speakers, support diverse dataset...
EgorLakomkin/KTSpeechCrawler
Automatically constructing corpus for automatic speech recognition from YouTube videos