chengchingwen/BytePairEncoding.jl
Julia implementation of Byte Pair Encoding for NLP
This tool helps developers working with large language models to efficiently break down text into smaller, manageable pieces, or 'tokens'. It takes raw text as input and outputs a list of these tokens, which can then be fed into a model for training or analysis. Anyone building or fine-tuning Natural Language Processing models, especially those based on OpenAI's GPT series, would find this useful.
No commits in the last 6 months.
Use this if you are a developer integrating or developing Natural Language Processing models in Julia and need to preprocess text data using Byte Pair Encoding for tasks like text generation or understanding.
Not ideal if you are not a developer working with text data or if your project doesn't require Julia for NLP tasks.
Stars
27
Forks
4
Language
Julia
License
MIT
Category
Last pushed
Jun 15, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/chengchingwen/BytePairEncoding.jl"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
soaxelbrooke/python-bpe
Byte Pair Encoding for Python!