kaistAI/FLASK
[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
This project helps evaluate how well a large language model (LLM) performs on various tasks by breaking down its capabilities into specific skills. You provide the LLM's raw text outputs, and it analyzes them to tell you which skills (like reasoning or summarization) the model excels at, across different topics and difficulty levels. This is for AI researchers, product managers, or anyone needing a detailed understanding of an LLM's strengths and weaknesses.
217 stars. No commits in the last 6 months.
Use this if you need to thoroughly assess and compare the performance of different large language models beyond simple accuracy scores, focusing on specific cognitive skills and domains.
Not ideal if you're looking for a quick, high-level evaluation or don't have access to OpenAI's GPT-4 API for the underlying scoring.
Stars
217
Forks
19
Language
Python
License
—
Category
Last pushed
Dec 24, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/kaistAI/FLASK"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
EuroEval/EuroEval
The robust European language model benchmark.
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents