TIGER-AI-Lab/LongICLBench

Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]

/ 100

Emerging

This project helps evaluate how well large language models (LLMs) perform when classifying text into many different categories, especially when given a lot of examples to learn from. It takes various text classification datasets with many labels (like emotions or intents) and different LLMs as input, then measures how accurately the models classify the text. AI researchers and practitioners who are building or deploying LLMs for complex classification tasks would use this.

112 stars. No commits in the last 6 months.

Use this if you need to benchmark the performance of large language models on challenging text classification tasks with a high number of categories and long context windows.

Not ideal if you are looking to fine-tune an LLM or want a tool for general-purpose natural language processing tasks beyond extreme-label classification.

LLM evaluation text classification natural language processing AI model benchmarking machine learning research

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 11 / 25

How are scores calculated?

Stars

112

Forks

Language

Python

License

MIT

Higher-rated alternatives

ExtensityAI/symbolicai

A neurosymbolic perspective on LLMs

TIGER-AI-Lab/MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding...

deep-symbolic-mathematics/LLM-SR

[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on Scientific Equation...

microsoft/interwhen

A framework for verifiable reasoning with language models.

zhudotexe/fanoutqa

Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language...

Explore Transformer Models

All categories Trending Transformer directory Insights