THUDM/AlignBench

大模型多维度中文对齐评测基准 (ACL 2024)

/ 100

Emerging

AlignBench helps you thoroughly evaluate how well large Chinese language models align with human instructions. You input a Chinese language model's responses to a standardized set of user questions, and it outputs a detailed, multi-dimensional score and analysis of its performance. This is for researchers, developers, or product managers who need to assess and compare the alignment quality of different Chinese large language models for real-world applications.

421 stars.

Use this if you need a comprehensive and reliable way to benchmark the 'human-likeness' and instruction-following ability of Chinese large language models.

Not ideal if you are looking to evaluate non-Chinese language models or are interested in metrics other than human alignment and instruction following.

large-language-models chinese-nlp model-evaluation ai-alignment instruction-following

No License No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 13 / 25

How are scores calculated?

Stars

421

Forks

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

xlang-ai/OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

bigcode-project/bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

THUDM/AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

scicode-bench/SciCode

A benchmark that challenges language models to code solutions for scientific problems

Explore LLM Tools

All categories Trending LLM Tool directory Insights