deepanprabhu/fastbpe
Java library implementing Byte-Pair Encoding Tokenization
This library helps natural language processing (NLP) developers efficiently prepare large volumes of text data for machine learning models. It takes raw text as input and converts it into subword units using Byte-Pair Encoding (BPE), which improves the handling of rare words and reduces vocabulary size. NLP engineers and researchers building machine translation, text summarization, or other language models would use this.
No commits in the last 6 months.
Use this if you are an NLP developer working with Java and need a fast, robust way to tokenize very large text datasets (1GB or more) using Byte-Pair Encoding.
Not ideal if you are not a Java developer or if your text processing needs do not involve Byte-Pair Encoding or large-scale tokenization.
Stars
9
Forks
—
Language
Java
License
MIT
Category
Last pushed
May 17, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/deepanprabhu/fastbpe"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
georg-jung/FastBertTokenizer
Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.
ml-rust/splintr
A high-performance tokenizer (BPE + SentencePiece) built with Rust with Python bindings, focused...
sanderland/script_tok
Code for the paper "BPE stays on SCRIPT"
ash-01xor/bpe.c
Simple Byte pair Encoding mechanism used for tokenization process . written purely in C
U4RASD/r-bpe
R-BPE: Improving BPE-Tokenizers with Token Reuse