YJiangcm/FollowBench
[ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
This project helps evaluate how well large language models (LLMs) follow complex instructions. It takes an LLM and a set of instructions with varying constraints (content, style, format, etc.) as input. The output is a detailed breakdown of how precisely the LLM satisfied each constraint and overall instruction, presented in easy-to-understand metrics. Developers and researchers building or integrating LLMs would use this to rigorously test their models' reliability.
119 stars. No commits in the last 6 months.
Use this if you need to systematically and precisely measure how well your large language model adheres to detailed, multi-level instructions and constraints.
Not ideal if you are a casual user of LLMs and simply want to generate creative text without needing to rigorously evaluate constraint adherence.
Stars
119
Forks
19
Language
Python
License
Apache-2.0
Category
Last pushed
Jun 12, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/YJiangcm/FollowBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
aidatatools/ollama-benchmark
LLM Benchmark for Throughput via Ollama (Local LLMs)
LarHope/ollama-benchmark
Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.
qcri/LLMeBench
Benchmarking Large Language Models
THUDM/LongBench
LongBench v2 and LongBench (ACL 25'&24')