claws-lab/XLingEval
Code and Resources for the paper, "Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries"
This project helps healthcare providers, researchers, or anyone evaluating large language models (LLMs) understand how well these models answer medical questions in different languages. It takes healthcare questions in English, Spanish, Chinese, and Hindi, along with existing correct answers, and measures the correctness, consistency, and verifiability of LLM responses. You would use this if you need to objectively compare LLM performance for health-related inquiries across a global, multilingual user base.
No commits in the last 6 months.
Use this if you need to rigorously test how accurately and reliably LLMs provide medical information in non-English languages.
Not ideal if you are looking for a tool to develop or fine-tune an LLM, rather than evaluate an existing one.
Stars
19
Forks
3
Language
Python
License
Apache-2.0
Category
Last pushed
Apr 01, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/claws-lab/XLingEval"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
FudanSELab/ClassEval
Benchmark ClassEval for class-level code generation.
microsoft/NeMoEval
A Benchmark Tool for Natural Language-based Network Management
apartresearch/specificityplus
👩💻 Code for the ACL paper "Detecting Edit Failures in LLMs: An Improved Specificity Benchmark"
HICAI-ZJU/SciKnowEval
SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models
nicolay-r/RuSentRel-Leaderboard
This is an official Leaderboard for the RuSentRel-1.1 dataset originally described in paper...