magpie-align/magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data generation pipeline!
Magpie helps AI researchers and developers create high-quality instruction-following datasets for large language models (LLMs). It takes an existing, aligned LLM as input and generates synthetic user queries and appropriate LLM responses. This eliminates the need for manual prompt engineering or seed questions, making the process of generating alignment data efficient.
834 stars. No commits in the last 6 months.
Use this if you are developing or fine-tuning LLMs and need a large, high-quality dataset of user prompts and ideal model responses for instruction alignment.
Not ideal if you are looking for a general-purpose LLM to chat with, or if you need to generate specific content based on custom prompts.
Stars
834
Forks
67
Language
Python
License
MIT
Category
Last pushed
Mar 17, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/magpie-align/magpie"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
mlabonne/llm-datasets
Curated list of datasets and tools for post-training.
malteos/llm-datasets
A collection of datasets for language model pretraining including scripts for downloading,...
jd-coderepos/llms4subjects
The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository
willxxy/ECG-Bench
A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)
geobrain-ai/geogalactica
Code and datasets for paper "GeoGalactica: A Scientific Large Language Model in Geoscience"