ImadSaddik/DarijaTokenizers

Free to use tokenizers trained on the Darija language.

/ 100

Emerging

This project offers ready-to-use text tokenizers specifically designed for the Darija language. It takes raw Darija text as input and converts it into numerical tokens, which are essential for building and training large language models (LLMs). Developers working on natural language processing (NLP) applications for Darija speakers will find these tokenizers valuable.

No commits in the last 6 months.

Use this if you are a developer building an LLM or any NLP application for the Darija language and need an efficient way to convert text into numerical representations.

Not ideal if your primary text data is in Latin letters or languages other than Darija, as its performance may be suboptimal.

Natural Language Processing LLM Development Darija Language Text Preprocessing AI Development

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 8 / 25

Community 18 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

google/sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Systemcluster/kitoken

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...

soaxelbrooke/python-bpe

Byte Pair Encoding for Python!

Explore NLP Tools

All categories Trending NLP directory Insights