deepanprabhu/fastbpe

Java library implementing Byte-Pair Encoding Tokenization

21
/ 100
Experimental

This library helps natural language processing (NLP) developers efficiently prepare large volumes of text data for machine learning models. It takes raw text as input and converts it into subword units using Byte-Pair Encoding (BPE), which improves the handling of rare words and reduces vocabulary size. NLP engineers and researchers building machine translation, text summarization, or other language models would use this.

No commits in the last 6 months.

Use this if you are an NLP developer working with Java and need a fast, robust way to tokenize very large text datasets (1GB or more) using Byte-Pair Encoding.

Not ideal if you are not a Java developer or if your text processing needs do not involve Byte-Pair Encoding or large-scale tokenization.

natural-language-processing text-tokenization machine-translation text-preprocessing nlp-engineering
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 0 / 25

How are scores calculated?

Stars

9

Forks

Language

Java

License

MIT

Category

bpe-tokenizers

Last pushed

May 17, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/deepanprabhu/fastbpe"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.