IAAR-Shanghai/UHGEval

[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.

/ 100

Emerging

This tool helps researchers, product managers, and AI safety engineers evaluate how often Chinese Large Language Models (LLMs) generate incorrect or made-up information. It takes an LLM and a prompt as input, and then uses various methods to measure the 'hallucination rate' of the generated Chinese text. The output helps users understand how reliable specific Chinese LLMs are for tasks requiring factual accuracy.

180 stars. No commits in the last 6 months.

Use this if you need a comprehensive and user-friendly way to measure the factual accuracy and hallucination tendencies of Chinese LLMs across different benchmarks.

Not ideal if your primary concern is evaluating LLMs for English content or if you require fine-grained analysis beyond hallucination, such as toxicity or bias.

LLM evaluation AI safety Chinese NLP hallucination detection content verification

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

180

Forks

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

EuroEval/EuroEval

The robust European language model benchmark.

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Explore LLM Tools

All categories Trending LLM Tool directory Insights