zli12321/qa_metrics

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

46
/ 100
Emerging

This helps evaluate how well a question-answering system or large language model generates answers. You provide the questions, the correct answers, and the system's generated answers, and it outputs scores indicating the quality and accuracy of the generated responses. This is for anyone who needs to assess the performance of AI models designed to answer questions, like an AI product manager, researcher, or quality assurance specialist.

No commits in the last 6 months. Available on PyPI.

Use this if you need to quickly and comprehensively assess the quality of answers produced by various question-answering systems, from short facts to longer explanations.

Not ideal if you are looking for a tool to generate questions or answers rather than evaluate them, or if you don't have existing correct answers to compare against.

AI model evaluation natural language processing conversational AI information retrieval text generation
Stale 6m
Maintenance 2 / 25
Adoption 8 / 25
Maturity 25 / 25
Community 11 / 25

How are scores calculated?

Stars

61

Forks

6

Language

Python

License

MIT

Last pushed

Jul 18, 2025

Commits (30d)

0

Dependencies

6

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/zli12321/qa_metrics"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.