Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.

/ 100

Established

This tool helps language model developers and researchers efficiently convert and use tokenizers from various popular formats like SentencePiece, HuggingFace Tokenizers, and OpenAI Tiktoken. It takes existing tokenizer model files and allows them to quickly process text into tokens for language models, and convert tokens back into text. Users who build or experiment with custom language models and need fast, compatible tokenization across different environments would find this useful.

46 stars and 5,322 monthly downloads. Available on PyPI and npm.

Use this if you need a high-performance, versatile tokenizer for your language models that is compatible with widely used formats and can run in Python, JavaScript, or Rust.

Not ideal if you are an end-user of an existing language model and do not need to customize or integrate tokenization at a programmatic level.

language-model-development NLP-research text-processing AI-engineering tokenizer-compatibility

No Dependents

Maintenance 10 / 25

Adoption 17 / 25

Maturity 25 / 25

Community 3 / 25

How are scores calculated?

Stars

Forks

Language

Rust

License

BSD-2-Clause

Compare

kitoken and Tokenizer

Related tools

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

soaxelbrooke/python-bpe

Byte Pair Encoding for Python!

Explore NLP Tools

All categories Trending NLP directory Insights