gweidart/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

44
/ 100
Emerging

When working with Large Language Models, it's often critical to precisely count and manage text by 'tokens' to fit within API limits or process efficiently. This tool helps you accurately split long texts into chunks of a specific token count, incrementally track token usage as you add more text, or quickly find the token count for any section. It's designed for data scientists, machine learning engineers, and developers building applications that interact with LLMs, who need robust and fast text processing.

No commits in the last 6 months. Available on PyPI.

Use this if you are building applications with Large Language Models and need a highly performant and accurate way to count, split, and manage text based on token boundaries, especially for long or incrementally growing texts.

Not ideal if your application doesn't involve tokenization for LLMs, or if you only need very basic, infrequent text processing where extreme speed and token boundary precision aren't critical.

Large-Language-Models NLP-Engineering Text-Chunking Tokenization AI-Application-Development
Stale 6m No Dependents
Maintenance 0 / 25
Adoption 7 / 25
Maturity 25 / 25
Community 12 / 25

How are scores calculated?

Stars

37

Forks

5

Language

Python

License

MIT

Category

bpe-tokenizers

Last pushed

Mar 19, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/gweidart/rs-bpe"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.