magpie-align/magpie

[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data generation pipeline!

/ 100

Emerging

Magpie helps AI researchers and developers create high-quality instruction-following datasets for large language models (LLMs). It takes an existing, aligned LLM as input and generates synthetic user queries and appropriate LLM responses. This eliminates the need for manual prompt engineering or seed questions, making the process of generating alignment data efficient.

834 stars. No commits in the last 6 months.

Use this if you are developing or fine-tuning LLMs and need a large, high-quality dataset of user prompts and ideal model responses for instruction alignment.

Not ideal if you are looking for a general-purpose LLM to chat with, or if you need to generate specific content based on custom prompts.

LLM development AI model training synthetic data generation instruction tuning AI alignment

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

834

Forks

Language

Python

License

MIT

Higher-rated alternatives

mlabonne/llm-datasets

Curated list of datasets and tools for post-training.

malteos/llm-datasets

A collection of datasets for language model pretraining including scripts for downloading,...

jd-coderepos/llms4subjects

The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository

willxxy/ECG-Bench

A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)

geobrain-ai/geogalactica

Code and datasets for paper "GeoGalactica: A Scientific Large Language Model in Geoscience"

Explore Transformer Models

All categories Trending Transformer directory Insights