CanvaChen/chinese-llama-tokenizer

目标：构建一个更符合语言学的小而美的 llama 分词器，支持中英日三国语言

/ 100

Experimental

This tool helps large language model (LLM) developers build more linguistically appropriate models that can process text in Chinese, English, and Japanese. It takes raw text input and outputs a sequence of 'tokens' that represent the text in a way that aligns better with human language structures, especially for Chinese. This is for developers creating new LLMs or fine-tuning existing Llama3-compatible models for multilingual applications.

No commits in the last 6 months.

Use this if you are developing a new large language model and need a compact, linguistically sound tokenizer that efficiently handles Chinese, English, and Japanese text while maintaining compatibility with Llama3's dialogue structure.

Not ideal if you are looking for a tokenizer primarily optimized for English text encoding efficiency, as Llama3's native tokenizer performs better in that specific area.

large-language-model-development natural-language-processing multilingual-ai text-tokenization ai-model-training

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 5 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

shibing624/MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline....

lyogavin/airllm

AirLLM 70B inference with single 4GB GPU

GradientHQ/parallax

Parallax is a distributed model serving framework that lets you build your own AI cluster anywhere

CrazyBoyM/llama3-Chinese-chat

Llama3、Llama3.1 中文后训练版仓库 - 微调、魔改版本有趣权重 & 训练、推理、评测、部署教程视频 & 文档。

CLUEbenchmark/CLUE

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained...

Explore Transformer Models

All categories Trending Transformer directory Insights