BobMcDear/minbpe-hs
Byte-level byte pair encoding (BPE) in Haskell
This project helps developers compress text data efficiently using Byte Pair Encoding (BPE). It takes a raw text corpus as input and outputs a set of merge rules and a vocabulary for tokenization. This allows other Haskell developers to integrate BPE into their applications for tasks like natural language processing, where text compression and tokenization are crucial.
No commits in the last 6 months.
Use this if you are a Haskell developer looking for a functional and performant implementation of byte-level Byte Pair Encoding for text tokenization and compression.
Not ideal if your input text contains non-ASCII characters and you need exact compatibility with Python's regex-based BPE tokenizers.
Stars
17
Forks
1
Language
Haskell
License
MIT
Category
Last pushed
May 27, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/BobMcDear/minbpe-hs"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
eliben/go-sentencepiece
Go implementation of the SentencePiece tokenizer
sefineh-ai/Amharic-Tokenizer
Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
mdabir1203/BPE_Tokenizer_Visualizer
A Visualizer to check how BPE Tokenizer in an LLM Works
franciszekparma/GBPET
GPT-style language model with Byte Pair Encoding tokenizer, built from scratch in PyTorch.
sajjadh47/bpe-encoder-php
BPE (Byte-Pair Encoding) Encoder Decoder for OpenAI's GPT-2 / GPT-3 Implemented In Pure PHP,...