IAAR-Shanghai/xFinder

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

/ 100

Emerging

xFinder helps those who develop or design benchmarks for Large Language Models (LLMs) to accurately extract key answers from model responses. It takes LLM output and benchmark questions as input and provides precise, extracted answers, replacing less reliable methods like Regular Expressions. This ensures more trustworthy and meaningful comparisons of different LLM performances.

180 stars. Available on PyPI.

Use this if you need to reliably and accurately extract specific answers from Large Language Model outputs when evaluating their performance against a benchmark.

Not ideal if your primary goal is to generate text with an LLM or fine-tune an LLM for specific creative tasks rather than evaluate its factual answer accuracy.

LLM evaluation benchmark development AI model assessment natural language processing model comparison

Maintenance 6 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 8 / 25

How are scores calculated?

Stars

180

Forks

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

EuroEval/EuroEval

The robust European language model benchmark.

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Explore LLM Tools

All categories Trending LLM Tool directory Insights