THUDM/AlignBench
大模型多维度中文对齐评测基准 (ACL 2024)
AlignBench helps you thoroughly evaluate how well large Chinese language models align with human instructions. You input a Chinese language model's responses to a standardized set of user questions, and it outputs a detailed, multi-dimensional score and analysis of its performance. This is for researchers, developers, or product managers who need to assess and compare the alignment quality of different Chinese large language models for real-world applications.
421 stars.
Use this if you need a comprehensive and reliable way to benchmark the 'human-likeness' and instruction-following ability of Chinese large language models.
Not ideal if you are looking to evaluate non-Chinese language models or are interested in metrics other than human alignment and instruction following.
Stars
421
Forks
29
Language
Python
License
—
Category
Last pushed
Oct 25, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/THUDM/AlignBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems