bnosac/sentencepiece
R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece
This tool helps you prepare text data for advanced analysis by breaking sentences down into smaller, meaningful units like subwords or characters. It takes raw text as input and outputs tokenized text or numerical IDs, which can then be fed into machine learning models. Anyone working with text data in R, such as data scientists, computational linguists, or researchers, would find this useful for natural language processing tasks.
Use this if you need to precisely control how text is broken down into tokens for natural language processing tasks within R, especially for languages with complex word structures.
Not ideal if you only need basic word tokenization or if your primary work isn't done in the R programming environment.
Stars
28
Forks
6
Language
C++
License
MPL-2.0
Category
Last pushed
Feb 09, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/bnosac/sentencepiece"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
soaxelbrooke/python-bpe
Byte Pair Encoding for Python!