chaoswork/sft_datasets
开源SFT数据集整理,随时补充
This project compiles a variety of open-source datasets designed for training large language models. It provides structured collections of text-based data, ranging from general instructions and multi-turn conversations to specialized tasks like mathematical reasoning, code generation, and financial question-answering. The primary users are researchers, students, and practitioners working on developing or fine-tuning AI models, especially those with a focus on natural language processing in Chinese or English.
571 stars. No commits in the last 6 months.
Use this if you are developing or fine-tuning a large language model and need diverse, pre-collected datasets for instruction-following, task completion, or dialogue generation.
Not ideal if you are looking for ready-to-use AI models, a simple API for text generation, or datasets primarily focused on non-textual data.
Stars
571
Forks
42
Language
—
License
—
Category
Last pushed
Jun 02, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/chaoswork/sft_datasets"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
EQTPartners/PTEC
Code repository corresponding to the paper "Prompt Tuned Embedding Classification for...
ImadSaddik/BoDmaghDataset
BoDmagh dataset is a Supervised Fine-Tuning (SFT) dataset for the Darija language
angeluriot/French_instruct
A dataset of instructions and answers in natural language for machine learning.
andrewzamai/SLIMER
Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER