chengchingwen/BytePairEncoding.jl

Julia implementation of Byte Pair Encoding for NLP

/ 100

Emerging

This tool helps developers working with large language models to efficiently break down text into smaller, manageable pieces, or 'tokens'. It takes raw text as input and outputs a list of these tokens, which can then be fed into a model for training or analysis. Anyone building or fine-tuning Natural Language Processing models, especially those based on OpenAI's GPT series, would find this useful.

No commits in the last 6 months.

Use this if you are a developer integrating or developing Natural Language Processing models in Julia and need to preprocess text data using Byte Pair Encoding for tasks like text generation or understanding.

Not ideal if you are not a developer working with text data or if your project doesn't require Julia for NLP tasks.

Natural Language Processing Large Language Models Text Tokenization AI/ML Development Machine Learning Engineering

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

Julia

License

MIT

Higher-rated alternatives

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...

soaxelbrooke/python-bpe

Byte Pair Encoding for Python!

Explore NLP Tools

All categories Trending NLP directory Insights