OFA-Sys/DiverseEvol

Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning

/ 100

Experimental

This project helps machine learning engineers efficiently train large language models (LLMs) by intelligently selecting the most impactful training data. You provide a large instruction dataset and an LLM, and it outputs a smaller, highly diverse subset of that data and an instruction-tuned LLM that performs as well or better than models trained on the full dataset. This is for professionals building and deploying custom LLMs who need to optimize training time and resources.

No commits in the last 6 months.

Use this if you are developing custom large language models and want to significantly reduce the data volume and computational cost of instruction tuning without sacrificing performance.

Not ideal if you are a casual user of off-the-shelf LLMs or do not have the technical expertise to manage model training environments and configurations.

LLM-development model-training-optimization data-sampling natural-language-processing machine-learning-engineering

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 8 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

DaoD/INTERS

This is the repository for our paper "INTERS: Unlocking the Power of Large Language Models in...

declare-lab/instruct-eval

This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca...

Haiyang-W/TokenFormer

[ICLR2025 Spotlight🔥] Official Implementation of TokenFormer: Rethinking Transformer Scaling...

hkust-nlp/deita

Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]

kehanlu/DeSTA2

Code and model for ICASSP 2025 Paper "Developing Instruction-Following Speech Language Model...

Explore Transformer Models

All categories Trending Transformer directory Insights