IAAR-Shanghai/GuessArena
[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning
This tool helps researchers and product managers thoroughly evaluate how well Large Language Models (LLMs) understand specific industries like finance, healthcare, or education, and how effectively they can reason within those domains. You provide unstructured documents relevant to your chosen field, and the tool outputs detailed reports on an LLM's knowledge and reasoning abilities. This is for anyone tasked with assessing or selecting LLMs for specialized business applications.
Use this if you need to precisely measure an LLM's grasp of domain-specific information and its ability to apply logic within complex, real-world industry contexts.
Not ideal if you're looking for general-purpose LLM evaluations that don't require deep dives into specialized knowledge or complex reasoning within a particular industry.
Stars
9
Forks
1
Language
Python
License
Apache-2.0
Category
Last pushed
Nov 15, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/IAAR-Shanghai/GuessArena"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
EuroEval/EuroEval
The robust European language model benchmark.
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents