Llm Bias Evaluation Transformer Models

There are 23 llm bias evaluation models tracked. 1 score above 50 (established tier). The highest-rated is google-deepmind/long-form-factuality at 55/100 with 672 stars.

Get all 23 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-bias-evaluation&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	google-deepmind/long-form-factuality Benchmarking long-form factuality in large language models. Original code...	55	Established	672	Python
2	gnai-creator/aletheion-llm-v2 Decoder-only LLM with integrated epistemic tomography. Knows what it doesn't know.	38	Emerging	2	Python
3	sandylaker/ib-edl Calibrating LLMs with Information-Theoretic Evidential Deep Learning (ICLR 2025)	37	Emerging	17	Python
4	nightdessert/Retrieval_Head open-source code for paper: Retrieval Head Mechanistically Explains...	33	Emerging	236	Python
5	MLD3/steerability An open-source evaluation framework for measuring LLM steerability.	33	Emerging	4	Jupyter Notebook
6	kazemihabib/Mitigating-Reasoning-LLM-Social-Bias A novel approach to mitigating social bias in Large Language Models through...	32	Emerging	3	Python
7	EternityYW/BiasEval-LLM-MentalHealth Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models	32	Emerging	12	Jupyter Notebook
8	aigc-apps/PertEval [NeurIPS '24 Spotlight] PertEval: Unveiling Real Knowledge Capacity of LLMs...	31	Emerging	14	Jupyter Notebook
9	bowen-upenn/llm_token_bias [EMNLP 2024] A Peek into Token Bias: Large Language Models Are Not Yet...	30	Emerging	26	Python
10	chandar-lab/CAIRO We explain why fairness metrics don't correlate and propose CAIRO to make...	30	Emerging	2	Python
11	xingbpshen/medical-calibration-fairness-mllm [MICCAI 2025] The official implementation of the paper "Exposing and...	25	Experimental	5	Python
12	x-zheng16/CALM [AAAI 25] CALM: Curiosity-Driven Auditing for LLMs	25	Experimental	5	Python
13	fannie1208/FactTest [ICML2025] "FactTest: Factuality Testing in Large Language Models with...	23	Experimental	9	Python
14	fabthebest/EIC_Framework_Calibration LLM decision-calibration engine based on Shannon Entropy and semantic...	21	Experimental	—	Jupyter Notebook
15	jwmke/BiasCompass Using LLMs to detect bias in news articles.	20	Experimental	5	Jupyter Notebook
16	joaoaleite/PASTEL PASTEL (Prompted weAk Supervision wiTh crEdibility signaLs) is a weakly...	19	Experimental	3	Jupyter Notebook
17	datos-Fundar/sesgos_LLM ¿Cómo “se equivocan” los modelos LLM?	18	Experimental	2	Jupyter Notebook
18	brucelyu17/SC-TC-Bench [FAccT '25] Characterizing Bias: Benchmarking LLMs in Simplified versus...	17	Experimental	4	Python
19	mtichikawa/llm-bias-detection Research project detecting and quantifying demographic bias in language models	14	Experimental	—	Jupyter Notebook
20	Wazzabeee/Bias-Mitigation-In-LLM Research POC on the mitigation of bias in large language models (FLAN-T5 and...	12	Experimental	7	Jupyter Notebook
21	Indiiigo/LLM_rep_review Systematic Review of the Demographic Representativeness of LLMs	11	Experimental	1	Jupyter Notebook
22	cognitivefactory/llm-bias-analysis Benchmark tool aimed at evaluating biases of large language models	11	Experimental	—	Jupyter Notebook
23	anoopkdcs/affective_bias_in_plm Affevtive Bias in Large Pre-trained Language Models	11	Experimental	—	Jupyter Notebook