Llm Interpretability Explainability Transformer Models

There are 54 llm interpretability explainability models tracked. The highest-rated is MadryLab/context-cite at 49/100 with 325 stars.

Get all 54 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-interpretability-explainability&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 MadryLab/context-cite

Attribute (or cite) statements generated by LLMs back to in-context information.

49
Emerging
2 microsoft/augmented-interpretable-models

Interpretable and efficient predictors using pre-trained language models....

48
Emerging
3 Trustworthy-ML-Lab/CB-LLMs

[ICLR 25] A novel framework for building intrinsically interpretable LLMs...

44
Emerging
4 poloclub/LLM-Attributor

LLM Attributor: Attribute LLM's Generated Text to Training Data

41
Emerging
5 THUDM/LongCite

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

39
Emerging
6 UKPLab/5pils

Code associated with the EMNLP 2024 Main paper: "Image, tell me your story!"...

39
Emerging
7 hao-ai-lab/Consistency_LLM

[ICML 2024] CLLMs: Consistency Large Language Models

38
Emerging
8 yueyu1030/AttrPrompt

[NeurIPS 2023] This is the code for the paper `Large Language Model as...

38
Emerging
9 nlpkeg/Know-MRI

This is an official code for the [ACL 2025 Demo] paper: Know-MRI: A...

37
Emerging
10 leap-laboratories/PIZZA

An attribution library for LLMs

37
Emerging
11 phvv-me/frame-representation-hypothesis

Official Repository for Frame Representation Hypothesis paper

36
Emerging
12 msakarvadia/memorization

Localizing Memorized Sequences in Language Models

36
Emerging
13 ntt-dkiku/route-explainer

The official implementation of "RouteExplainer: An Explanation Framework for...

35
Emerging
14 AI4LIFE-GROUP/LLM_Explainer

Code for paper: Are Large Language Models Post Hoc Explainers?

35
Emerging
15 itsqyh/Awesome-LMMs-Mechanistic-Interpretability

A curated collection of resources focused on the Mechanistic...

34
Emerging
16 microsoft/MMLU-CF

A Contamination-free Multi-task Language Understanding Benchmark [Official, ACL 2025]

33
Emerging
17 parameterlab/apricot

Source code of "Calibrating Large Language Models Using Their Generations...

33
Emerging
18 songxiaoshuai/progco

Official Implementation of "ProgCo: Program Helps Self-Correction of Large...

33
Emerging
19 yinzhangyue/SelfAware

Do Large Language Models Know What They Don’t Know?

32
Emerging
20 jwergieluk/revllm

RevLLM -- Reverse Engineering Tools for Large Language Models

31
Emerging
21 Trustworthy-ML-Lab/VLG-CBM

[NeurIPS 24] A new training and evaluation framework for learning...

31
Emerging
22 llm-misinformation/llm-misinformation

The dataset and code for the ICLR 2024 paper "Can LLM-Generated...

30
Emerging
23 salesforce/factualNLG

Code for the arXiv paper: "LLMs as Factual Reasoners: Insights from Existing...

30
Emerging
24 Zhang-Yihao/Adversarial-Representation-Engineering

Official implementation repository for the paper Towards General Conceptual...

30
Emerging
25 plusnli/medical-knowledge-judgment

Codes and data for paper "Fact or Guesswork? Evaluating Large Language...

29
Experimental
26 UKPLab/arxiv2025-misleading-visualizations

Code and datasets accompanying the arXiv preprint: "Protecting multimodal...

29
Experimental
27 gsarti/pecore

Materials for "Quantifying the Plausibility of Context Reliance in Neural...

27
Experimental
28 yyy01/PAC

The official implementation of the paper "Data Contamination Calibration for...

27
Experimental
29 LFhase/CausalCOAT

[NeurIPS 2024] Discovery of the Hidden World with Large Language Models

27
Experimental
30 Trustworthy-ML-Lab/Describe-and-Dissect

[TMLR 25] An automated method for explaining complex neuron behaviors in...

26
Experimental
31 bgreenwell/statlingua

Explain Statistical Output with Large Language Models

25
Experimental
32 Strong-AI-Lab/Explanation-Generation

We introduce "ILearner-LLM" a framework that uses iterative enhancement with...

24
Experimental
33 Human-Centric-Machine-Learning/counterfactual-llms

Code for "Counterfactual Token Generation in Large Language Models", Arxiv 2024.

24
Experimental
34 HamedBabaei/CoLLM

CoLLM: Consistency of Large Language Models in Knowledge Engineering

23
Experimental
35 tbohne/saliency_kd

Saliency map-guided knowledge discovery for subclass identification with...

23
Experimental
36 Trustworthy-ML-Lab/Efficient-LLM-automated-interpretability

[NeurIPS'23 ATTRIB] An efficient framework to generate neuron explanations for LLMs

22
Experimental
37 braingpt-lovelab/backwards

Source code for

21
Experimental
38 Faisalse/LLM-reproducibility-audit

https://faisalse.github.io/LLM-reproducibility-audit/

21
Experimental
39 Aniezka/xfact-fever

Official repository of FEVER@ACL 2025 paper "When Scale Meets Diversity:...

21
Experimental
40 kasia-kobalczyk/guess_llm

Implementation of the probing models presented in the ICLR 2026 paper...

21
Experimental
41 AikyamLab/llm-memorization

Understanding the memorization property of Large Language Models using Model...

21
Experimental
42 psunlpgroup/VerbosityLLM

This repository maintains dataset, predictions, and code for paper:...

20
Experimental
43 k-randl/self-explaining_llms

Official implementation of the papers "Evaluating the Reliability of...

19
Experimental
44 dennismstfc/building-the-soedermizer

Building the Söd★mizer: AI-Driven Machine Translation for Gender-Sensitive...

19
Experimental
45 zhaochen0110/LMLM

Code and data for "Improving Temporal Generalization of Pre-trained Language...

19
Experimental
46 lindaCai1997/data-attribution

Scalable Gradient-Based Attribution of LLM Behaviors

17
Experimental
47 RManLuo/llm-facteval

Source code of paper "Systematic Assessment of Factual Knowledge in Large...

14
Experimental
48 stvsever/aHFR_TokenSHAP

This repository implements an adaptive, hierarchy-aware Shapley method for...

13
Experimental
49 gianniskalyvas/llm-posthoc-explainability

A study on post-hoc explainability in LLMs using counterfactual...

13
Experimental
50 ShiningLab/CON2LM

This repository is for the paper Word Surprisal Correlates with Sentential...

13
Experimental
51 froge159/belief-project-sef

Activation-Space Interventions for Causal Control of Belief Representations...

13
Experimental
52 BridgeAI-Lab/LLM-as-Meta-Reviewer

[NAACL'25] Dataset and Evaluation Code for Paper LLMs as Meta-Reviewers’...

12
Experimental
53 abhilash-neog/FactCheckingBioLLMs

Evaluating the reasoning ability of LLMs specifically within the biomedical...

12
Experimental
54 ExcellentDarkTea/LLM-Causal-Discovery

Using LLMs to support expert elicitation in causal discovery, combining...

10
Experimental