ml-rust/splintr

A high-performance tokenizer (BPE + SentencePiece) built with Rust with Python bindings, focused on speed, safety, and resource optimization.

/ 100

Emerging

This tool helps AI engineers and machine learning practitioners quickly convert large volumes of text into tokens, and vice-versa. It takes raw text inputs like prompts, documents, or training data and outputs numerical tokens, which are essential for processing by large language models (LLMs). This is ideal for anyone working with LLMs who needs to prepare data efficiently or process model outputs in real-time.

Use this if you are an AI engineer or ML practitioner building LLM applications, training models, or processing large text datasets and need a significantly faster way to tokenize text than existing Python-based solutions.

Not ideal if you are working with very small, infrequent text inputs or if your current tokenization speed is not a bottleneck for your workflow.

LLM-development AI-engineering data-preprocessing natural-language-processing model-training

No Package No Dependents

Maintenance 10 / 25

Adoption 13 / 25

Maturity 13 / 25

Community 9 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

georg-jung/FastBertTokenizer

Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.

sanderland/script_tok

Code for the paper "BPE stays on SCRIPT"

ash-01xor/bpe.c

Simple Byte pair Encoding mechanism used for tokenization process . written purely in C

U4RASD/r-bpe

R-BPE: Improving BPE-Tokenizers with Token Reuse

jmaczan/bpe-tokenizer

Byte-Pair Encoding tokenizer for training large language models on huge datasets

Explore NLP Tools

All categories Trending NLP directory Insights