AragonerUA/SampoNLP
A corpus-free toolkit for morphological lexicon creation and tokenizer evaluation using MDL-inspired atomicity scoring for Uralic languages
This project helps linguists and language researchers automatically break down words into their basic meaning units (morphemes) without needing pre-labeled data. You feed it large amounts of raw text in languages like Finnish, Estonian, or Hungarian, and it outputs a lexicon of discovered morphemes and how words are composed of them. It's designed for computational linguists or researchers working with morphologically complex languages.
Available on PyPI.
Use this if you need to perform unsupervised morphological analysis and build a morpheme lexicon for Uralic languages from raw text.
Not ideal if you are working with languages that are not morphologically rich or if you require a pre-trained, rule-based morphological analyzer for highly specific tasks.
Stars
8
Forks
—
Language
Python
License
—
Category
Last pushed
Dec 11, 2025
Commits (30d)
0
Dependencies
2
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/AragonerUA/SampoNLP"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
mikahama/uralicNLP
An NLP library for Uralic languages such as Finnish, Skolt Sami, Moksha and so on. Also...
SkyworkAI/Skywork
Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and...
gia-uh/lingo
A Python library for context engineering.
shamspias/lexsublm-lite
A laptop‑friendly toolkit for context‑aware single‑word paraphrasing and lexical‑substitution...
jiangnanboy/llm_corpus_quality
大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning