ropensci/tokenizers
Fast, Consistent Tokenization of Natural Language Text
When you have raw natural language text and need to break it down into meaningful units for analysis, this tool helps. It takes your documents, social media posts, or any collection of text and consistently splits them into words, sentences, paragraphs, or even smaller character chunks. Anyone doing text analysis, content classification, or digital humanities research will find this useful.
187 stars. No commits in the last 6 months.
Use this if you need a fast and reliable way to prepare text data by breaking it into standard tokens for further linguistic or quantitative analysis in R.
Not ideal if your primary goal is real-time processing of massive, streaming text data or if you need highly specialized domain-specific tokenization rules not covered by general linguistic principles.
Stars
187
Forks
24
Language
R
License
—
Category
Last pushed
Mar 27, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/ropensci/tokenizers"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
soaxelbrooke/python-bpe
Byte Pair Encoding for Python!