zli12321/qa_metrics
An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.
This helps evaluate how well a question-answering system or large language model generates answers. You provide the questions, the correct answers, and the system's generated answers, and it outputs scores indicating the quality and accuracy of the generated responses. This is for anyone who needs to assess the performance of AI models designed to answer questions, like an AI product manager, researcher, or quality assurance specialist.
No commits in the last 6 months. Available on PyPI.
Use this if you need to quickly and comprehensively assess the quality of answers produced by various question-answering systems, from short facts to longer explanations.
Not ideal if you are looking for a tool to generate questions or answers rather than evaluate them, or if you don't have existing correct answers to compare against.
Stars
61
Forks
6
Language
Python
License
MIT
Category
Last pushed
Jul 18, 2025
Commits (30d)
0
Dependencies
6
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/zli12321/qa_metrics"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
EuroEval/EuroEval
The robust European language model benchmark.
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents