meta-llama/synthetic-data-kit
Tool for generating high quality Synthetic datasets
The Synthetic Data Kit helps AI practitioners create high-quality, structured datasets for fine-tuning Large Language Models (LLMs). It takes unstructured inputs like PDFs, text files, or even YouTube transcripts and transforms them into formatted QA pairs or reasoning traces. This tool is designed for AI trainers, data scientists, or researchers who need specific, curated data to improve their LLM's performance for particular tasks.
1,524 stars. Available on PyPI.
Use this if you need to quickly generate tailored, high-quality synthetic data, such as question-answer pairs or reasoning chains, from various document types to fine-tune your Llama models or other LLMs.
Not ideal if you're looking for a general-purpose data labeling tool for human annotators, or if your primary need is for pre-training large language models from scratch.
Stars
1,524
Forks
215
Language
Python
License
MIT
Category
Last pushed
Oct 28, 2025
Commits (30d)
0
Dependencies
15
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/meta-llama/synthetic-data-kit"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related frameworks
Diyago/Tabular-data-generation
We well know GANs for success in the realistic image generation. However, they can be applied in...
Data-Centric-AI-Community/ydata-synthetic
Synthetic data generators for tabular and time-series data
tdspora/syngen
Open-source version of the TDspora synthetic data generation algorithm.
vanderschaarlab/synthcity
A library for generating and evaluating synthetic tabular data for privacy, fairness and data...
always-further/deepfabric
Generate High-Quality Synthetics, Train, Measure, and Evaluate in a Single Pipeline