Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.
This tool helps language model developers and researchers efficiently convert and use tokenizers from various popular formats like SentencePiece, HuggingFace Tokenizers, and OpenAI Tiktoken. It takes existing tokenizer model files and allows them to quickly process text into tokens for language models, and convert tokens back into text. Users who build or experiment with custom language models and need fast, compatible tokenization across different environments would find this useful.
46 stars and 5,322 monthly downloads. Available on PyPI and npm.
Use this if you need a high-performance, versatile tokenizer for your language models that is compatible with widely used formats and can run in Python, JavaScript, or Rust.
Not ideal if you are an end-user of an existing language model and do not need to customize or integrate tokenization at a programmatic level.
Stars
46
Forks
1
Language
Rust
License
BSD-2-Clause
Category
Last pushed
Mar 10, 2026
Monthly downloads
5,322
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Systemcluster/kitoken"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Related tools
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
soaxelbrooke/python-bpe
Byte Pair Encoding for Python!