yc9701/pansori-tedxkr-corpus

Korean ASR Corpus generated from TEDx talks

/ 100

Emerging

This is a collection of Korean speech audio clips and their corresponding text transcripts, sourced from TEDx talks given in Korea between 2010 and 2014. It provides high-quality Korean speech data, about 3 hours in total from 41 speakers, as FLAC audio files and text pairs. Language researchers, AI developers, and speech technology engineers would use this to train or evaluate Korean speech recognition systems.

No commits in the last 6 months.

Use this if you need a pre-compiled, high-quality dataset of spoken Korean and its text for developing or testing speech recognition models.

Not ideal if you need a very large-scale corpus (this is about 3 hours) or require speech data from different domains or time periods beyond TEDx talks from 2010-2014.

Korean speech recognition ASR data linguistic research voice technology development machine learning datasets

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

—

License

—

Higher-rated alternatives

ynop/audiomate

Python library for handling audio datasets.

reazon-research/ReazonSpeech

Massive open Japanese speech corpus

common-voice/cv-dataset

Metadata and versioning details for the Common Voice dataset

davidmartinrius/speech-dataset-generator

🔊 Create labeled datasets, enhance audio quality, identify speakers, support diverse dataset...

EgorLakomkin/KTSpeechCrawler

Automatically constructing corpus for automatic speech recognition from YouTube videos

Explore Voice AI Tools

All categories Trending Voice AI directory Insights