CanvaChen/chinese-llama-tokenizer

目标:构建一个更符合语言学的小而美的 llama 分词器,支持中英日三国语言

27
/ 100
Experimental

This tool helps large language model (LLM) developers build more linguistically appropriate models that can process text in Chinese, English, and Japanese. It takes raw text input and outputs a sequence of 'tokens' that represent the text in a way that aligns better with human language structures, especially for Chinese. This is for developers creating new LLMs or fine-tuning existing Llama3-compatible models for multilingual applications.

No commits in the last 6 months.

Use this if you are developing a new large language model and need a compact, linguistically sound tokenizer that efficiently handles Chinese, English, and Japanese text while maintaining compatibility with Llama3's dialogue structure.

Not ideal if you are looking for a tokenizer primarily optimized for English text encoding efficiency, as Llama3's native tokenizer performs better in that specific area.

large-language-model-development natural-language-processing multilingual-ai text-tokenization ai-model-training
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 6 / 25
Maturity 16 / 25
Community 5 / 25

How are scores calculated?

Stars

20

Forks

1

Language

Python

License

Apache-2.0

Last pushed

Jun 02, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/CanvaChen/chinese-llama-tokenizer"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.