Peiyang-Song/LLM-A-Not-B-Errors

Official repository for paper "In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models" in Findings of EMNLP 2024

/ 100

Experimental

This project helps evaluate how well large language models (LLMs) perform on specific reasoning tasks, especially when given examples to learn from. It takes structured data representing various reasoning problems as input and outputs an analysis of whether LLMs make 'A-not-B' errors, which indicate faulty reasoning. This is primarily useful for AI researchers, cognitive scientists, and anyone critically evaluating the logical capabilities of LLMs.

Use this if you are researching the limitations of large language models' reasoning abilities, especially their susceptibility to specific logical fallacies during in-context learning.

Not ideal if you are looking for a tool to build or fine-tune LLMs for general applications, or if you need to perform natural language processing tasks outside of reasoning evaluation.

AI-evaluation LLM-reasoning cognitive-science in-context-learning computational-linguistics

No Package No Dependents

Maintenance 6 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

MIT

Higher-rated alternatives

yyDing1/ScaleQuest

[ACL 2025] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method...

yilin-geng/llm-instruction-conflicts

This repository contains the data and the code for the paper "Control Illusion: The Failure of...

valeria-izvoreanu/LLM-Hallucination-Detection-SemEval2024

Semi-supervised pipeline to detect LLM hallucinations. Uses Mistral-7B for zero-shot...

noanonkes/Hallucination-Detection-in-LLMs

Detecting Hallucinations in Large Language Model Generations using Graph Structures

Explore Transformer Models

All categories Trending Transformer directory Insights