dongjinleekr/beanpiece
A Java binding to Google SentencePiece
This is a tool for developers who work with text processing in Java. It helps integrate Google's SentencePiece tokenizer into Java applications. Developers can use it to break down raw text into meaningful subword units and reconstruct text from those units within their Java projects.
No commits in the last 6 months.
Use this if you are a Java developer needing to implement robust subword tokenization and detokenization in your applications.
Not ideal if you are not a Java developer or if you require SentencePiece functionality on Windows or macOS without compiling the native libraries yourself.
Stars
7
Forks
—
Language
C++
License
Apache-2.0
Category
Last pushed
Jun 28, 2018
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/dongjinleekr/beanpiece"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer