bnosac/sentencepiece

R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece

/ 100

Emerging

This tool helps you prepare text data for advanced analysis by breaking sentences down into smaller, meaningful units like subwords or characters. It takes raw text as input and outputs tokenized text or numerical IDs, which can then be fed into machine learning models. Anyone working with text data in R, such as data scientists, computational linguists, or researchers, would find this useful for natural language processing tasks.

Use this if you need to precisely control how text is broken down into tokens for natural language processing tasks within R, especially for languages with complex word structures.

Not ideal if you only need basic word tokenization or if your primary work isn't done in the R programming environment.

text-analysis natural-language-processing data-preparation computational-linguistics

No Package No Dependents

Maintenance 10 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

Forks

Language

C++

License

MPL-2.0

Higher-rated alternatives

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...

soaxelbrooke/python-bpe

Byte Pair Encoding for Python!

Explore NLP Tools

All categories Trending NLP directory Insights