kaistAI/FLASK

[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

/ 100

Emerging

This project helps evaluate how well a large language model (LLM) performs on various tasks by breaking down its capabilities into specific skills. You provide the LLM's raw text outputs, and it analyzes them to tell you which skills (like reasoning or summarization) the model excels at, across different topics and difficulty levels. This is for AI researchers, product managers, or anyone needing a detailed understanding of an LLM's strengths and weaknesses.

217 stars. No commits in the last 6 months.

Use this if you need to thoroughly assess and compare the performance of different large language models beyond simple accuracy scores, focusing on specific cognitive skills and domains.

Not ideal if you're looking for a quick, high-level evaluation or don't have access to OpenAI's GPT-4 API for the underlying scoring.

LLM-evaluation AI-model-assessment natural-language-processing model-benchmarking AI-performance-analysis

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 13 / 25

How are scores calculated?

Stars

217

Forks

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

EuroEval/EuroEval

The robust European language model benchmark.

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Explore LLM Tools

All categories Trending LLM Tool directory Insights