zencephalon/Tactful_Tokenizer
Accurate Bayesian sentence tokenizer in Ruby.
This tool helps developers accurately split raw text into individual sentences, even when dealing with tricky punctuation like question marks, exclamation points, and abbreviations. It takes in a block of text, potentially with some HTML formatting, and outputs a list of clearly separated sentences. A Ruby developer working on natural language processing tasks would find this useful for text preparation.
No commits in the last 6 months.
Use this if you are a Ruby developer needing to break down unstructured text, including unicode or text with simple HTML tags, into discrete sentences for further analysis.
Not ideal if you are not a Ruby developer or need very robust HTML parsing beyond basic tag recognition.
Stars
80
Forks
13
Language
Ruby
License
—
Category
Last pushed
Apr 30, 2014
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/zencephalon/Tactful_Tokenizer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
soaxelbrooke/python-bpe
Byte Pair Encoding for Python!