AragonerUA/SampoNLP

A corpus-free toolkit for morphological lexicon creation and tokenizer evaluation using MDL-inspired atomicity scoring for Uralic languages

32
/ 100
Emerging

This project helps linguists and language researchers automatically break down words into their basic meaning units (morphemes) without needing pre-labeled data. You feed it large amounts of raw text in languages like Finnish, Estonian, or Hungarian, and it outputs a lexicon of discovered morphemes and how words are composed of them. It's designed for computational linguists or researchers working with morphologically complex languages.

Available on PyPI.

Use this if you need to perform unsupervised morphological analysis and build a morpheme lexicon for Uralic languages from raw text.

Not ideal if you are working with languages that are not morphologically rich or if you require a pre-trained, rule-based morphological analyzer for highly specific tasks.

computational-linguistics morphological-analysis natural-language-processing uralic-languages lexicography
Maintenance 6 / 25
Adoption 4 / 25
Maturity 22 / 25
Community 0 / 25

How are scores calculated?

Stars

8

Forks

Language

Python

License

Last pushed

Dec 11, 2025

Commits (30d)

0

Dependencies

2

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/AragonerUA/SampoNLP"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.