yigitkonur/cli-finetune-dataset

weighted category-balanced dataset builder for LLM fine-tuning

/ 100

Emerging

When fine-tuning a Large Language Model, you often have many conversation examples grouped into different categories, but you need a single, balanced dataset for training. This tool takes a directory of your categorized conversation files and combines them into a single, shuffled dataset where each category contributes a specific, weighted proportion. It's designed for machine learning engineers or researchers preparing custom datasets for LLM fine-tuning.

Use this if you need to create a finely balanced dataset for LLM fine-tuning from multiple JSONL files, ensuring specific categories are represented at controlled proportions.

Not ideal if your input data isn't in OpenAI chat-format JSONL files or you don't need to balance categories by weight.

LLM fine-tuning dataset preparation natural language processing machine learning engineering

No License No Package No Dependents

Maintenance 10 / 25

Adoption 6 / 25

Maturity 8 / 25

Community 9 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

limix-ldm-ai/LimiX

LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence...

tatsu-lab/stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.

google-research/plur

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets...

YalaLab/pillar-finetune

Finetuning framework for Pillar medical imaging models.

thuml/LogME

Code release for "LogME: Practical Assessment of Pre-trained Models for Transfer Learning" (ICML...

Explore ML Frameworks

All categories Trending ML Framework directory Insights