guillaume-be/rust-tokenizers

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

/ 100

Established

This is a high-performance library that helps developers prepare text for use with large language models, such as BERT, GPT, and RoBERTa. It takes raw text input and converts it into numerical tokens, which are then fed into machine learning models. The primary users are developers building applications that process natural language, such as chatbots, sentiment analysis tools, or machine translation systems.

336 stars and 8,112 monthly downloads.

Use this if you are a developer working with Rust or Python and need to efficiently tokenize text for modern language models like BERT, GPT, or SentencePiece.

Not ideal if you are a non-developer seeking a ready-to-use application for text analysis or if you don't work with language models.

natural-language-processing machine-learning-engineering text-pre-processing AI-development computational-linguistics

No Package No Dependents

Maintenance 10 / 25

Adoption 19 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

336

Forks

Language

Rust

License

Apache-2.0

Compare

rust-tokenizers and tokenizer

Related tools

sugarme/tokenizer

NLP tokenizers written in Go language

elixir-nx/tokenizers

Elixir bindings for 🤗 Tokenizers

openscilab/tocount

ToCount: Lightweight Token Estimator

reinfer/blingfire-rs

Rust wrapper for the BlingFire tokenization library

Scurrra/ubpe

Universal (general sequence) Byte-Pair Encoding

Explore NLP Tools

All categories Trending NLP directory Insights