KodCode-AI/kodcode
✨ A synthetic dataset generation framework that produces diverse coding questions and verifiable solutions - all in one framwork
This framework helps AI researchers and developers create robust datasets for training large language models (LLMs) on coding tasks. It takes various inputs like coding prompts, human-written questions, or code snippets, and generates a diverse set of coding questions along with verifiable solutions and tests. The output is a high-quality, synthetic dataset suitable for fine-tuning LLMs to improve their code generation and problem-solving abilities.
312 stars. No commits in the last 6 months.
Use this if you need to generate a large, varied, and automatically verifiable dataset of coding questions and solutions to train or evaluate code-focused AI models.
Not ideal if you are looking for a pre-trained model for code generation, or if your primary need is for datasets in domains other than coding.
Stars
312
Forks
18
Language
Python
License
Apache-2.0
Category
Last pushed
Sep 06, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/generative-ai/KodCode-AI/kodcode"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
sdv-dev/SDV
Synthetic data generation for tabular data
sdv-dev/SDGym
Benchmarking synthetic data generation methods.
NVIDIA-NeMo/DataDesigner
🎨 NeMo Data Designer: A general library for generating high-quality synthetic data from scratch...
AlexanderVNikitin/tsgm
Generation and evaluation of synthetic time series datasets (also, augmentations,...
mostly-ai/mostlyai
Synthetic Data SDK ✨