CanvaChen/chinese-llama-tokenizer
目标:构建一个更符合语言学的小而美的 llama 分词器,支持中英日三国语言
This tool helps large language model (LLM) developers build more linguistically appropriate models that can process text in Chinese, English, and Japanese. It takes raw text input and outputs a sequence of 'tokens' that represent the text in a way that aligns better with human language structures, especially for Chinese. This is for developers creating new LLMs or fine-tuning existing Llama3-compatible models for multilingual applications.
No commits in the last 6 months.
Use this if you are developing a new large language model and need a compact, linguistically sound tokenizer that efficiently handles Chinese, English, and Japanese text while maintaining compatibility with Llama3's dialogue structure.
Not ideal if you are looking for a tokenizer primarily optimized for English text encoding efficiency, as Llama3's native tokenizer performs better in that specific area.
Stars
20
Forks
1
Language
Python
License
Apache-2.0
Category
Last pushed
Jun 02, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/CanvaChen/chinese-llama-tokenizer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
shibing624/MedicalGPT
MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline....
lyogavin/airllm
AirLLM 70B inference with single 4GB GPU
GradientHQ/parallax
Parallax is a distributed model serving framework that lets you build your own AI cluster anywhere
CrazyBoyM/llama3-Chinese-chat
Llama3、Llama3.1 中文后训练版仓库 - 微调、魔改版本有趣权重 & 训练、推理、评测、部署教程视频 & 文档。
CLUEbenchmark/CLUE
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained...