Peiyang-Song/LLM-A-Not-B-Errors
Official repository for paper "In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models" in Findings of EMNLP 2024
This project helps evaluate how well large language models (LLMs) perform on specific reasoning tasks, especially when given examples to learn from. It takes structured data representing various reasoning problems as input and outputs an analysis of whether LLMs make 'A-not-B' errors, which indicate faulty reasoning. This is primarily useful for AI researchers, cognitive scientists, and anyone critically evaluating the logical capabilities of LLMs.
Use this if you are researching the limitations of large language models' reasoning abilities, especially their susceptibility to specific logical fallacies during in-context learning.
Not ideal if you are looking for a tool to build or fine-tune LLMs for general applications, or if you need to perform natural language processing tasks outside of reasoning evaluation.
Stars
13
Forks
—
Language
Python
License
MIT
Category
Last pushed
Jan 10, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/Peiyang-Song/LLM-A-Not-B-Errors"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
yyDing1/ScaleQuest
[ACL 2025] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method...
yilin-geng/llm-instruction-conflicts
This repository contains the data and the code for the paper "Control Illusion: The Failure of...
valeria-izvoreanu/LLM-Hallucination-Detection-SemEval2024
Semi-supervised pipeline to detect LLM hallucinations. Uses Mistral-7B for zero-shot...
noanonkes/Hallucination-Detection-in-LLMs
Detecting Hallucinations in Large Language Model Generations using Graph Structures