scarletcho/KoLM

Korean text normalization and language preparation package for LM in Kaldi-based ASR system

52
/ 100
Established

This package helps prepare Korean text data for building speech recognition systems. It takes raw Korean text files, cleans them by normalizing, removing unwanted characters, and breaking them into sentences. The output is a refined text corpus and a pronunciation dictionary ready for use in Kaldi-based Automatic Speech Recognition (ASR) language model development. This tool is designed for speech scientists, linguists, or ASR engineers working with Korean language models.

No commits in the last 6 months. Available on PyPI.

Use this if you need to meticulously clean, normalize, and preprocess large Korean text corpora to create a robust language model and pronunciation dictionary for an ASR system.

Not ideal if your primary goal is general natural language processing tasks that do not require detailed grapheme-to-phone conversion or specific language model file formats for ASR.

speech-recognition korean-linguistics language-modeling text-preprocessing phonetics
Stale 6m No Dependents
Maintenance 0 / 25
Adoption 8 / 25
Maturity 25 / 25
Community 19 / 25

How are scores calculated?

Stars

63

Forks

21

Language

Python

License

GPL-3.0

Last pushed

Apr 23, 2020

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/scarletcho/KoLM"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.