IAAR-Shanghai/xFinder

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

49
/ 100
Emerging

xFinder helps those who develop or design benchmarks for Large Language Models (LLMs) to accurately extract key answers from model responses. It takes LLM output and benchmark questions as input and provides precise, extracted answers, replacing less reliable methods like Regular Expressions. This ensures more trustworthy and meaningful comparisons of different LLM performances.

180 stars. Available on PyPI.

Use this if you need to reliably and accurately extract specific answers from Large Language Model outputs when evaluating their performance against a benchmark.

Not ideal if your primary goal is to generate text with an LLM or fine-tune an LLM for specific creative tasks rather than evaluate its factual answer accuracy.

LLM evaluation benchmark development AI model assessment natural language processing model comparison
Maintenance 6 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 8 / 25

How are scores calculated?

Stars

180

Forks

7

Language

Python

License

Last pushed

Nov 14, 2025

Commits (30d)

0

Dependencies

7

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/IAAR-Shanghai/xFinder"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.