lechmazur/sycophancy

LLM benchmark and leaderboard for narrator-bias sycophancy, opposite-narrator contradictions, and judgment consistency.

/ 100

Experimental

This benchmark helps evaluate how consistently large language models (LLMs) make judgments when presented with the same scenario from opposing first-person perspectives. It takes in a set of dispute cases, each presented from a neutral view and multiple first-person views (some with emotional framing). The output is a leaderboard ranking models by their tendency to agree with both sides of a dispute, or to reject both, helping practitioners understand a model's bias and judgment stability. This is for AI researchers, product managers, or evaluators who work with LLMs.

Use this if you need to assess the fairness and consistency of an LLM's judgments, especially when the model is exposed to biased or emotionally charged narratives.

Not ideal if you are looking for a general-purpose LLM performance benchmark or if your primary concern is factual accuracy rather than narrator-bias sycophancy.

LLM-evaluation AI-ethics bias-detection model-auditing AI-fairness

No License No Package No Dependents

Maintenance 10 / 25

Adoption 5 / 25

Maturity 3 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

—

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

EuroEval/EuroEval

The robust European language model benchmark.

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Explore LLM Tools

All categories Trending LLM Tool directory Insights