chaoswork/sft_datasets

开源SFT数据集整理,随时补充

33
/ 100
Emerging

This project compiles a variety of open-source datasets designed for training large language models. It provides structured collections of text-based data, ranging from general instructions and multi-turn conversations to specialized tasks like mathematical reasoning, code generation, and financial question-answering. The primary users are researchers, students, and practitioners working on developing or fine-tuning AI models, especially those with a focus on natural language processing in Chinese or English.

571 stars. No commits in the last 6 months.

Use this if you are developing or fine-tuning a large language model and need diverse, pre-collected datasets for instruction-following, task completion, or dialogue generation.

Not ideal if you are looking for ready-to-use AI models, a simple API for text generation, or datasets primarily focused on non-textual data.

AI model training natural language processing machine learning datasets language model fine-tuning instruction tuning
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 8 / 25
Community 15 / 25

How are scores calculated?

Stars

571

Forks

42

Language

License

Last pushed

Jun 02, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/chaoswork/sft_datasets"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.