tokenizers and language-tokenizer

These are competitors: Hugging Face's tokenizers library is a production-grade, widely-adopted implementation that handles state-of-the-art tokenization across multiple languages, while language-tokenizer appears to be an alternative approach with similar goals but lacks adoption and maintenance.

tokenizers

Verified

language-tokenizer

Experimental

Maintenance 20/25

Adoption 25/25

Maturity 25/25

Community 20/25

Maintenance 10/25

Adoption 0/25

Maturity 11/25

Community 0/25

Stars: 10,520

Forks: 1,051

Downloads: 1,504,044

Commits (30d): 45

Language: Rust

License: Apache-2.0

Stars: —

Forks: —

Downloads: —

Commits (30d): 0

Language: Rust

License: WTFPL

No risk flags

No Package No Dependents

About tokenizers

huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

When working with large volumes of text for natural language processing, this tool helps you convert raw text into a format that machine learning models can understand. It takes your raw text documents as input and produces a 'vocabulary' and 'tokens'—which are numerical representations of words or sub-word units. This is essential for AI researchers and machine learning engineers building or fine-tuning language models.

natural-language-processing machine-learning-engineering text-pre-processing AI-model-training

About language-tokenizer

mazebrr/language-tokenizer

🧩 Tokenize text efficiently across multiple languages using our robust library, combining Unicode and NLP techniques for accurate text analysis.

Related comparisons

tokenizers and tftokenizers tokenizers and gotokenizers tokenizers and libtokenizers tokenizers and azerbaijani-tokenizer

Scores updated daily from GitHub, PyPI, and npm data. How scores work