fnl/syntok
Text tokenization and sentence segmentation (segtok v2)
This tool helps you automatically break down long pieces of text into individual sentences and words. It takes your raw text documents, like reports or articles, and outputs a structured list of sentences, with each sentence further broken into its constituent words and punctuation. Anyone working with text data, such as researchers, data analysts, or content strategists, who needs to prepare text for further analysis would find this useful.
209 stars. No commits in the last 6 months.
Use this if you need to precisely split text written in English, Spanish, or German into clean sentences and individual words for natural language processing or text analysis.
Not ideal if your primary need is to process text in languages other than English, Spanish, or German, as its specialized accuracy might not apply.
Stars
209
Forks
35
Language
Python
License
MIT
Category
Last pushed
Mar 12, 2022
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/fnl/syntok"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
EmilStenstrom/conllu
A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
OpenPecha/Botok
🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
zaemyung/sentsplit
A flexible sentence segmentation library using CRF model and regex rules
taishi-i/nagisa
A Japanese tokenizer based on recurrent neural networks
natasha/razdel
Rule-based token, sentence segmentation for Russian language