gweidart/rs-bpe
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
When working with Large Language Models, it's often critical to precisely count and manage text by 'tokens' to fit within API limits or process efficiently. This tool helps you accurately split long texts into chunks of a specific token count, incrementally track token usage as you add more text, or quickly find the token count for any section. It's designed for data scientists, machine learning engineers, and developers building applications that interact with LLMs, who need robust and fast text processing.
No commits in the last 6 months. Available on PyPI.
Use this if you are building applications with Large Language Models and need a highly performant and accurate way to count, split, and manage text based on token boundaries, especially for long or incrementally growing texts.
Not ideal if your application doesn't involve tokenization for LLMs, or if you only need very basic, infrequent text processing where extreme speed and token boundary precision aren't critical.
Stars
37
Forks
5
Language
Python
License
MIT
Category
Last pushed
Mar 19, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/gweidart/rs-bpe"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.