hscspring/bytepiece-rs
The Bytepiece Tokenizer Implemented in Rust.
This tool helps you convert text into numerical tokens for machine learning models, ensuring consistent processing across different languages. You input raw text in any language and receive a sequence of numbers representing that text. It's ideal for data scientists, natural language processing engineers, or researchers building multilingual text analysis systems.
No commits in the last 6 months.
Use this if you need a language-agnostic way to prepare text data for AI models, especially when dealing with diverse languages or code.
Not ideal if you require a tokenizer specifically optimized for a single language with advanced linguistic features or morphology.
Stars
14
Forks
—
Language
Rust
License
—
Category
Last pushed
Nov 28, 2023
Monthly downloads
33
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/hscspring/bytepiece-rs"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer