thisiscetin/textoken
Simple and customizable text tokenization gem.
Textoken helps Ruby developers break down natural language text into individual words or tokens, making it easier to analyze. You provide a string of text, and it returns a list of words or phrases, with options to include or exclude specific types like punctuation, numbers, or dates. This is useful for anyone building applications that need to process and understand text data, such as web crawlers or text analysis tools.
No commits in the last 6 months.
Use this if you are a Ruby developer building an application that needs to extract and categorize specific words or elements from text based on patterns or length.
Not ideal if you need a solution for a programming language other than Ruby or require advanced linguistic analysis like part-of-speech tagging or sentiment analysis.
Stars
31
Forks
3
Language
Ruby
License
MIT
Category
Last pushed
Sep 28, 2021
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/thisiscetin/textoken"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer