ai-twinkle/Eval

Twinkle Eval：高效且準確的 AI 評測工具

/ 100

Emerging

This tool helps AI practitioners and researchers objectively compare and analyze the performance of different Large Language Models (LLMs). You provide your LLM API details and datasets (CSV, JSON, etc., with questions and answers), and it generates comprehensive reports on model accuracy, stability, and inference speed. It's designed for anyone who needs to rigorously evaluate LLMs, such as MMLU or TMMLU+ benchmarks.

No commits in the last 6 months.

Use this if you need an efficient and accurate way to benchmark various Large Language Models (LLMs) against specific datasets to understand their performance and stability.

Not ideal if you only need to run a single, quick test on an LLM without deep analysis or comparative benchmarking.

LLM evaluation AI model benchmarking natural language processing AI research model comparison

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 17 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

EuroEval/EuroEval

The robust European language model benchmark.

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Explore LLM Tools

All categories Trending LLM Tool directory Insights