AndersonBY/deepseek-tokenizer

DeepSeek Tokenizer is an efficient and lightweight tokenization library with no third-party runtime dependencies, making it a streamlined and efficient choice for tokenization tasks.

/ 100

Emerging

This library helps developers convert human-readable text into numerical tokens that artificial intelligence models can understand and process. You input a string of text, and it outputs a list of corresponding token IDs. This is used by AI developers and machine learning engineers who are preparing text data for training or inference with large language models.

Available on PyPI.

Use this if you are a developer working with AI models and need a fast, lightweight way to tokenize text without adding many other software dependencies to your project.

Not ideal if you are an end-user looking for a tool to analyze text or generate content, as this is a technical component for building AI systems.

AI-development natural-language-processing machine-learning-engineering text-pre-processing language-model-preparation

No License No Dependents

Maintenance 10 / 25

Adoption 8 / 25

Maturity 17 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Related tools

cahya-wirawan/rwkv-tokenizer

A fast RWKV Tokenizer written in Rust

dakofler/simple_tokenizers

Tokenizers is a collection of tokenization implementations focused on transparency and readability

SiriPrathikantam/custom-tokenize

A Python tokenizer project with regex, vocab building, and logging.

Explore NLP Tools

All categories Trending NLP directory Insights