bnosac/sentencepiece

R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece

49
/ 100
Emerging

This tool helps you prepare text data for advanced analysis by breaking sentences down into smaller, meaningful units like subwords or characters. It takes raw text as input and outputs tokenized text or numerical IDs, which can then be fed into machine learning models. Anyone working with text data in R, such as data scientists, computational linguists, or researchers, would find this useful for natural language processing tasks.

Use this if you need to precisely control how text is broken down into tokens for natural language processing tasks within R, especially for languages with complex word structures.

Not ideal if you only need basic word tokenization or if your primary work isn't done in the R programming environment.

text-analysis natural-language-processing data-preparation computational-linguistics
No Package No Dependents
Maintenance 10 / 25
Adoption 7 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

28

Forks

6

Language

C++

License

MPL-2.0

Last pushed

Feb 09, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/bnosac/sentencepiece"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.