meta-llama/synthetic-data-kit

Tool for generating high quality Synthetic datasets

/ 100

Established

The Synthetic Data Kit helps AI practitioners create high-quality, structured datasets for fine-tuning Large Language Models (LLMs). It takes unstructured inputs like PDFs, text files, or even YouTube transcripts and transforms them into formatted QA pairs or reasoning traces. This tool is designed for AI trainers, data scientists, or researchers who need specific, curated data to improve their LLM's performance for particular tasks.

1,524 stars. Available on PyPI.

Use this if you need to quickly generate tailored, high-quality synthetic data, such as question-answer pairs or reasoning chains, from various document types to fine-tune your Llama models or other LLMs.

Not ideal if you're looking for a general-purpose data labeling tool for human annotators, or if your primary need is for pre-training large language models from scratch.

LLM-fine-tuning AI-data-preparation natural-language-processing model-training AI-research

Maintenance 6 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 22 / 25

How are scores calculated?

Stars

1,524

Forks

215

Language

Python

License

MIT

Compare

synthetic-data-kit and Synthetic-data-gen

Related frameworks

Diyago/Tabular-data-generation

We well know GANs for success in the realistic image generation. However, they can be applied in...

Data-Centric-AI-Community/ydata-synthetic

Synthetic data generators for tabular and time-series data

tdspora/syngen

Open-source version of the TDspora synthetic data generation algorithm.

vanderschaarlab/synthcity

A library for generating and evaluating synthetic tabular data for privacy, fairness and data...

always-further/deepfabric

Generate High-Quality Synthetics, Train, Measure, and Evaluate in a Single Pipeline

Explore ML Frameworks

All categories Trending ML Framework directory Insights