soaxelbrooke/python-bpe
Byte Pair Encoding for Python!
This tool helps prepare large amounts of text for analysis by breaking it down into smaller, meaningful sub-word units. You provide a body of text, and it learns how to split words into common segments, outputting a vocabulary and a way to encode new text based on these segments. It's useful for data scientists or researchers working with natural language processing who need to efficiently process text for machine learning models.
232 stars. No commits in the last 6 months. Available on PyPI.
Use this if you need to pre-process large text datasets by tokenizing them into sub-word units for natural language processing tasks.
Not ideal if you need a production-ready solution for text tokenization, as other specialized libraries are recommended for that purpose.
Stars
232
Forks
39
Language
Python
License
MIT
Category
Last pushed
Sep 16, 2022
Commits (30d)
0
Dependencies
6
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/soaxelbrooke/python-bpe"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer