ropensci/tokenizers

Fast, Consistent Tokenization of Natural Language Text

/ 100

Emerging

When you have raw natural language text and need to break it down into meaningful units for analysis, this tool helps. It takes your documents, social media posts, or any collection of text and consistently splits them into words, sentences, paragraphs, or even smaller character chunks. Anyone doing text analysis, content classification, or digital humanities research will find this useful.

187 stars. No commits in the last 6 months.

Use this if you need a fast and reliable way to prepare text data by breaking it into standard tokens for further linguistic or quantitative analysis in R.

Not ideal if your primary goal is real-time processing of massive, streaming text data or if you need highly specialized domain-specific tokenization rules not covered by general linguistic principles.

text-analysis natural-language-processing digital-humanities linguistic-research content-analysis

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

187

Forks

Language

License

—

Higher-rated alternatives

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...

soaxelbrooke/python-bpe

Byte Pair Encoding for Python!

Explore NLP Tools

All categories Trending NLP directory Insights