gweidart/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

/ 100

Emerging

When working with Large Language Models, it's often critical to precisely count and manage text by 'tokens' to fit within API limits or process efficiently. This tool helps you accurately split long texts into chunks of a specific token count, incrementally track token usage as you add more text, or quickly find the token count for any section. It's designed for data scientists, machine learning engineers, and developers building applications that interact with LLMs, who need robust and fast text processing.

No commits in the last 6 months. Available on PyPI.

Use this if you are building applications with Large Language Models and need a highly performant and accurate way to count, split, and manage text based on token boundaries, especially for long or incrementally growing texts.

Not ideal if your application doesn't involve tokenization for LLMs, or if you only need very basic, infrequent text processing where extreme speed and token boundary precision aren't critical.

Large-Language-Models NLP-Engineering Text-Chunking Tokenization AI-Application-Development

Stale 6m No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 25 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Related tools

taabishhh/LLM_Preprocessing

This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec...

Explore Embedding Tools

All categories Trending Embeddings directory Insights