levyfan/sentencepiece-jni
Java JNI wrapper for SentencePiece: unsupervised text tokenizer for Neural Network-based text generation.
This project helps Java developers integrate SentencePiece, an unsupervised text tokenizer, into their applications. It takes raw text and converts it into numerical IDs or subword pieces, which are then used as input for neural network-based text generation models. AI/ML engineers and data scientists working with Java will find this useful for preparing text data.
No commits in the last 6 months.
Use this if you are a Java developer building applications that require high-performance text tokenization for machine learning models, particularly for text generation.
Not ideal if you are not working with Java or if your text processing needs do not involve machine learning models and subword tokenization.
Stars
38
Forks
14
Language
C++
License
MIT
Category
Last pushed
Jan 16, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/levyfan/sentencepiece-jni"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer