InternScience/SciEvalKit

A unified evaluation toolkit and leaderboard for rigorously assessing the scientific intelligence of large language and vision–language models across the full research workflow.

/ 100

Emerging

This toolkit helps AI researchers and developers accurately measure how well large language and vision-language models perform on complex scientific tasks, not just general conversations. It takes a model and a set of scientific challenges (like interpreting images, symbolic reasoning, or generating code) and outputs a detailed score, revealing how scientifically intelligent the model truly is across different research workflow stages. Scientists, engineers, and AI developers building or using these advanced models would find this essential for rigorous evaluation.

Use this if you need to rigorously evaluate the scientific intelligence of large language or vision-language models across the entire research workflow, rather than relying on general-purpose benchmarks.

Not ideal if you are looking for a simple, quick way to test a model's basic conversational or broad-domain reasoning abilities.

AI-model-evaluation scientific-AI research-workflow-automation multimodal-AI scientific-computing

No Package No Dependents

Maintenance 10 / 25

Adoption 9 / 25

Maturity 13 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

EuroEval/EuroEval

The robust European language model benchmark.

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Explore LLM Tools

All categories Trending LLM Tool directory Insights