Peiyang-Song/LLM-A-Not-B-Errors

Official repository for paper "In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models" in Findings of EMNLP 2024

27
/ 100
Experimental

This project helps evaluate how well large language models (LLMs) perform on specific reasoning tasks, especially when given examples to learn from. It takes structured data representing various reasoning problems as input and outputs an analysis of whether LLMs make 'A-not-B' errors, which indicate faulty reasoning. This is primarily useful for AI researchers, cognitive scientists, and anyone critically evaluating the logical capabilities of LLMs.

Use this if you are researching the limitations of large language models' reasoning abilities, especially their susceptibility to specific logical fallacies during in-context learning.

Not ideal if you are looking for a tool to build or fine-tune LLMs for general applications, or if you need to perform natural language processing tasks outside of reasoning evaluation.

AI-evaluation LLM-reasoning cognitive-science in-context-learning computational-linguistics
No Package No Dependents
Maintenance 6 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 0 / 25

How are scores calculated?

Stars

13

Forks

Language

Python

License

MIT

Last pushed

Jan 10, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/Peiyang-Song/LLM-A-Not-B-Errors"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.