AragonerUA/SampoNLP

A corpus-free toolkit for morphological lexicon creation and tokenizer evaluation using MDL-inspired atomicity scoring for Uralic languages

/ 100

Emerging

This project helps linguists and language researchers automatically break down words into their basic meaning units (morphemes) without needing pre-labeled data. You feed it large amounts of raw text in languages like Finnish, Estonian, or Hungarian, and it outputs a lexicon of discovered morphemes and how words are composed of them. It's designed for computational linguists or researchers working with morphologically complex languages.

Available on PyPI.

Use this if you need to perform unsupervised morphological analysis and build a morpheme lexicon for Uralic languages from raw text.

Not ideal if you are working with languages that are not morphologically rich or if you require a pre-trained, rule-based morphological analyzer for highly specific tasks.

computational-linguistics morphological-analysis natural-language-processing uralic-languages lexicography

Maintenance 6 / 25

Adoption 4 / 25

Maturity 22 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Higher-rated alternatives

mikahama/uralicNLP

An NLP library for Uralic languages such as Finnish, Skolt Sami, Moksha and so on. Also...

SkyworkAI/Skywork

Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and...

gia-uh/lingo

A Python library for context engineering.

shamspias/lexsublm-lite

A laptop‑friendly toolkit for context‑aware single‑word paraphrasing and lexical‑substitution...

jiangnanboy/llm_corpus_quality

大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning

Explore NLP Tools

All categories Trending NLP directory Insights