deepanprabhu/fastbpe

Java library implementing Byte-Pair Encoding Tokenization

/ 100

Experimental

This library helps natural language processing (NLP) developers efficiently prepare large volumes of text data for machine learning models. It takes raw text as input and converts it into subword units using Byte-Pair Encoding (BPE), which improves the handling of rare words and reduces vocabulary size. NLP engineers and researchers building machine translation, text summarization, or other language models would use this.

No commits in the last 6 months.

Use this if you are an NLP developer working with Java and need a fast, robust way to tokenize very large text datasets (1GB or more) using Byte-Pair Encoding.

Not ideal if you are not a Java developer or if your text processing needs do not involve Byte-Pair Encoding or large-scale tokenization.

natural-language-processing text-tokenization machine-translation text-preprocessing nlp-engineering

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Java

License

MIT

Higher-rated alternatives

georg-jung/FastBertTokenizer

Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.

ml-rust/splintr

A high-performance tokenizer (BPE + SentencePiece) built with Rust with Python bindings, focused...

sanderland/script_tok

Code for the paper "BPE stays on SCRIPT"

ash-01xor/bpe.c

Simple Byte pair Encoding mechanism used for tokenization process . written purely in C

U4RASD/r-bpe

R-BPE: Improving BPE-Tokenizers with Token Reuse

Explore NLP Tools

All categories Trending NLP directory Insights