Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.

55
/ 100
Established

This tool helps language model developers and researchers efficiently convert and use tokenizers from various popular formats like SentencePiece, HuggingFace Tokenizers, and OpenAI Tiktoken. It takes existing tokenizer model files and allows them to quickly process text into tokens for language models, and convert tokens back into text. Users who build or experiment with custom language models and need fast, compatible tokenization across different environments would find this useful.

46 stars and 5,322 monthly downloads. Available on PyPI and npm.

Use this if you need a high-performance, versatile tokenizer for your language models that is compatible with widely used formats and can run in Python, JavaScript, or Rust.

Not ideal if you are an end-user of an existing language model and do not need to customize or integrate tokenization at a programmatic level.

language-model-development NLP-research text-processing AI-engineering tokenizer-compatibility
No Dependents
Maintenance 10 / 25
Adoption 17 / 25
Maturity 25 / 25
Community 3 / 25

How are scores calculated?

Stars

46

Forks

1

Language

Rust

License

BSD-2-Clause

Last pushed

Mar 10, 2026

Monthly downloads

5,322

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Systemcluster/kitoken"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.