Sovichea/khmer_segmenter
A zero-dependency, high-performance Khmer word segmenter using the Viterbi algorithm. Optimized for dictionary accuracy, ultra-low memory footprint, and edge deployment.
This project helps anyone working with Khmer language text by accurately breaking down sentences into individual words. You input a raw Khmer text, and it outputs the text segmented into its constituent words, highlighting any unknown terms. This tool is ideal for linguists, content creators, or data analysts who need precise word boundaries for further analysis or application development.
Use this if you need a reliable, deterministic, and fast way to segment Khmer text into words without relying on inconsistent manual annotations or complex machine learning setups.
Not ideal if your primary need is for a system that learns word boundaries from highly diverse, uncurated, and inconsistent text data, as this tool prioritizes dictionary accuracy.
Stars
34
Forks
4
Language
Python
License
MIT
Category
Last pushed
Jan 08, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Sovichea/khmer_segmenter"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
VietHoang1512/khmer-nltk
Khmer language processing toolkit
PyThaiNLP/attacut
A Fast and Accurate Neural Thai Word Segmenter
UlugbekSalaev/UzTransliterator
UzTransliterator | State-of-the-art machine transliteration tool for Uzbek language
seanghay/KhmerOCR
A Fast Khmer Optical Character Recognition (KhmerOCR)
seanghay/khmerphonemizer
A Free, Standalone and Open-Source Khmer Grapheme-to-Phonemes.