FudanDISC/ReForm-Eval

An benchmark for evaluating the capabilities of large vision-language models (LVLMs)

/ 100

Emerging

This project helps AI researchers and developers thoroughly evaluate how well large vision-language models (LVLMs) understand and reason about images and text. It takes existing multimodal benchmark datasets and converts them into a standardized format (multiple-choice or text generation problems). The output is a detailed quantitative analysis of an LVLM's performance across a wide range of visual and reasoning tasks, helping developers identify strengths and weaknesses.

No commits in the last 6 months.

Use this if you are developing large vision-language models and need a comprehensive, quantitative, and standardized way to benchmark their capabilities across various visual and language understanding tasks.

Not ideal if you are an end-user looking for an application of an LVLM, rather than a developer needing to evaluate the models themselves.

AI model evaluation large vision-language models benchmark datasets multimodal AI model performance analysis

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 8 / 25

Maturity 16 / 25

Community 7 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

HowieHwong/TrustLLM

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models

Intelligent-CAT-Lab/PLTranslationEmpirical

Artifact repository for the paper "Lost in Translation: A Study of Bugs Introduced by Large...

rishub-tamirisa/tamper-resistance

[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"

tsinghua-fib-lab/ANeurIPS2024_SPV-MIA

[NeurIPS'24] "Membership Inference Attacks against Fine-tuned Large Language Models via...

codessian/epistemic-confidence-layer

Model-agnostic trust protocol for calibrated, auditable AI

Explore Transformer Models

All categories Trending Transformer directory Insights