fnl/syntok

Text tokenization and sentence segmentation (segtok v2)

/ 100

Emerging

This tool helps you automatically break down long pieces of text into individual sentences and words. It takes your raw text documents, like reports or articles, and outputs a structured list of sentences, with each sentence further broken into its constituent words and punctuation. Anyone working with text data, such as researchers, data analysts, or content strategists, who needs to prepare text for further analysis would find this useful.

209 stars. No commits in the last 6 months.

Use this if you need to precisely split text written in English, Spanish, or German into clean sentences and individual words for natural language processing or text analysis.

Not ideal if your primary need is to process text in languages other than English, Spanish, or German, as its specialized accuracy might not apply.

text-analysis content-preparation linguistics information-extraction document-processing

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 19 / 25

How are scores calculated?

Stars

209

Forks

Language

Python

License

MIT

Higher-rated alternatives

EmilStenstrom/conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.

OpenPecha/Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

zaemyung/sentsplit

A flexible sentence segmentation library using CRF model and regex rules

taishi-i/nagisa

A Japanese tokenizer based on recurrent neural networks

natasha/razdel

Rule-based token, sentence segmentation for Russian language

Explore NLP Tools

All categories Trending NLP directory Insights