ai-twinkle/Eval
Twinkle Eval:高效且準確的 AI 評測工具
This tool helps AI practitioners and researchers objectively compare and analyze the performance of different Large Language Models (LLMs). You provide your LLM API details and datasets (CSV, JSON, etc., with questions and answers), and it generates comprehensive reports on model accuracy, stability, and inference speed. It's designed for anyone who needs to rigorously evaluate LLMs, such as MMLU or TMMLU+ benchmarks.
No commits in the last 6 months.
Use this if you need an efficient and accurate way to benchmark various Large Language Models (LLMs) against specific datasets to understand their performance and stability.
Not ideal if you only need to run a single, quick test on an LLM without deep analysis or comparative benchmarking.
Stars
89
Forks
16
Language
Python
License
MIT
Category
Last pushed
Aug 14, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/ai-twinkle/Eval"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
EvolvingLMMs-Lab/lmms-eval
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
vibrantlabsai/ragas
Supercharge Your LLM Application Evaluations 🚀
open-compass/VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
EuroEval/EuroEval
The robust European language model benchmark.
Giskard-AI/giskard-oss
🐢 Open-Source Evaluation & Testing library for LLM Agents