chaoswork/sft_datasets

开源SFT数据集整理,随时补充

/ 100

Emerging

This project compiles a variety of open-source datasets designed for training large language models. It provides structured collections of text-based data, ranging from general instructions and multi-turn conversations to specialized tasks like mathematical reasoning, code generation, and financial question-answering. The primary users are researchers, students, and practitioners working on developing or fine-tuning AI models, especially those with a focus on natural language processing in Chinese or English.

571 stars. No commits in the last 6 months.

Use this if you are developing or fine-tuning a large language model and need diverse, pre-collected datasets for instruction-following, task completion, or dialogue generation.

Not ideal if you are looking for ready-to-use AI models, a simple API for text generation, or datasets primarily focused on non-textual data.

AI model training natural language processing machine learning datasets language model fine-tuning instruction tuning

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 15 / 25

How are scores calculated?

Stars

571

Forks

Language

—

License

—

Higher-rated alternatives

EQTPartners/PTEC

Code repository corresponding to the paper "Prompt Tuned Embedding Classification for...

ImadSaddik/BoDmaghDataset

BoDmagh dataset is a Supervised Fine-Tuning (SFT) dataset for the Darija language

angeluriot/French_instruct

A dataset of instructions and answers in natural language for machine learning.

andrewzamai/SLIMER

Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER

Explore NLP Tools

All categories Trending NLP directory Insights