yenniejun/tokenizers-languages

Comparing LLM tokenizers in multiple languages

/ 100

Experimental

This tool helps researchers, linguists, and AI practitioners understand how Large Language Models (LLMs) break down text into 'tokens' across different languages. You input text in various languages, and it shows you how different LLM tokenizers process them, highlighting differences in token length. This is crucial for anyone working with multilingual LLMs to ensure fair and efficient language processing.

No commits in the last 6 months.

Use this if you are developing or evaluating large language models and need to understand how text is tokenized across diverse languages, especially non-English ones.

Not ideal if you are looking for a tool to translate text or analyze the grammatical structure of sentences, as its focus is specifically on tokenization efficiency.

natural-language-processing linguistics AI-model-evaluation multilingual-AI language-technology

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 4 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Higher-rated alternatives

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...

soaxelbrooke/python-bpe

Byte Pair Encoding for Python!

Explore NLP Tools

All categories Trending NLP directory Insights