ImadSaddik/DarijaTokenizers
Free to use tokenizers trained on the Darija language.
This project offers ready-to-use text tokenizers specifically designed for the Darija language. It takes raw Darija text as input and converts it into numerical tokens, which are essential for building and training large language models (LLMs). Developers working on natural language processing (NLP) applications for Darija speakers will find these tokenizers valuable.
No commits in the last 6 months.
Use this if you are a developer building an LLM or any NLP application for the Darija language and need an efficient way to convert text into numerical representations.
Not ideal if your primary text data is in Latin letters or languages other than Darija, as its performance may be suboptimal.
Stars
11
Forks
15
Language
Python
License
—
Category
Last pushed
Mar 26, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/ImadSaddik/DarijaTokenizers"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Systemcluster/kitoken
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers,...
soaxelbrooke/python-bpe
Byte Pair Encoding for Python!