OFA-Sys/DiverseEvol
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
This project helps machine learning engineers efficiently train large language models (LLMs) by intelligently selecting the most impactful training data. You provide a large instruction dataset and an LLM, and it outputs a smaller, highly diverse subset of that data and an instruction-tuned LLM that performs as well or better than models trained on the full dataset. This is for professionals building and deploying custom LLMs who need to optimize training time and resources.
No commits in the last 6 months.
Use this if you are developing custom large language models and want to significantly reduce the data volume and computational cost of instruction tuning without sacrificing performance.
Not ideal if you are a casual user of off-the-shelf LLMs or do not have the technical expertise to manage model training environments and configurations.
Stars
86
Forks
4
Language
Python
License
—
Category
Last pushed
Dec 14, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/OFA-Sys/DiverseEvol"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
DaoD/INTERS
This is the repository for our paper "INTERS: Unlocking the Power of Large Language Models in...
declare-lab/instruct-eval
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca...
Haiyang-W/TokenFormer
[ICLR2025 Spotlightš„] Official Implementation of TokenFormer: Rethinking Transformer Scaling...
hkust-nlp/deita
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
kehanlu/DeSTA2
Code and model for ICASSP 2025 Paper "Developing Instruction-Following Speech Language Model...