kaistAI/FLASK

[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

31
/ 100
Emerging

This project helps evaluate how well a large language model (LLM) performs on various tasks by breaking down its capabilities into specific skills. You provide the LLM's raw text outputs, and it analyzes them to tell you which skills (like reasoning or summarization) the model excels at, across different topics and difficulty levels. This is for AI researchers, product managers, or anyone needing a detailed understanding of an LLM's strengths and weaknesses.

217 stars. No commits in the last 6 months.

Use this if you need to thoroughly assess and compare the performance of different large language models beyond simple accuracy scores, focusing on specific cognitive skills and domains.

Not ideal if you're looking for a quick, high-level evaluation or don't have access to OpenAI's GPT-4 API for the underlying scoring.

LLM-evaluation AI-model-assessment natural-language-processing model-benchmarking AI-performance-analysis
No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 8 / 25
Community 13 / 25

How are scores calculated?

Stars

217

Forks

19

Language

Python

License

Last pushed

Dec 24, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/kaistAI/FLASK"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.