IAAR-Shanghai/xFinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
xFinder helps those who develop or design benchmarks for Large Language Models (LLMs) to accurately extract key answers from model responses. It takes LLM output and benchmark questions as input and provides precise, extracted answers, replacing less reliable methods like Regular Expressions. This ensures more trustworthy and meaningful comparisons of different LLM performances.
180 stars. Available on PyPI.
Use this if you need to reliably and accurately extract specific answers from Large Language Model outputs when evaluating their performance against a benchmark.
Not ideal if your primary goal is to generate text with an LLM or fine-tune an LLM for specific creative tasks rather than evaluate its factual answer accuracy.
Stars
180
Forks
7
Language
Python
License
—
Category
Last pushed
Nov 14, 2025
Commits (30d)
0
Dependencies
7
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/IAAR-Shanghai/xFinder"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
EuroEval/EuroEval
The robust European language model benchmark.
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents