yigitkonur/cli-finetune-dataset
weighted category-balanced dataset builder for LLM fine-tuning
When fine-tuning a Large Language Model, you often have many conversation examples grouped into different categories, but you need a single, balanced dataset for training. This tool takes a directory of your categorized conversation files and combines them into a single, shuffled dataset where each category contributes a specific, weighted proportion. It's designed for machine learning engineers or researchers preparing custom datasets for LLM fine-tuning.
Use this if you need to create a finely balanced dataset for LLM fine-tuning from multiple JSONL files, ensuring specific categories are represented at controlled proportions.
Not ideal if your input data isn't in OpenAI chat-format JSONL files or you don't need to balance categories by weight.
Stars
16
Forks
2
Language
Python
License
—
Category
Last pushed
Feb 21, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/yigitkonur/cli-finetune-dataset"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
limix-ldm-ai/LimiX
LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence...
tatsu-lab/stanford_alpaca
Code and documentation to train Stanford's Alpaca models, and generate the data.
google-research/plur
PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets...
YalaLab/pillar-finetune
Finetuning framework for Pillar medical imaging models.
thuml/LogME
Code release for "LogME: Practical Assessment of Pre-trained Models for Transfer Learning" (ICML...