TIGER-AI-Lab/LongICLBench
Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]
This project helps evaluate how well large language models (LLMs) perform when classifying text into many different categories, especially when given a lot of examples to learn from. It takes various text classification datasets with many labels (like emotions or intents) and different LLMs as input, then measures how accurately the models classify the text. AI researchers and practitioners who are building or deploying LLMs for complex classification tasks would use this.
112 stars. No commits in the last 6 months.
Use this if you need to benchmark the performance of large language models on challenging text classification tasks with a high number of categories and long context windows.
Not ideal if you are looking to fine-tune an LLM or want a tool for general-purpose natural language processing tasks beyond extreme-label classification.
Stars
112
Forks
8
Language
Python
License
MIT
Category
Last pushed
Feb 20, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/TIGER-AI-Lab/LongICLBench"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ExtensityAI/symbolicai
A neurosymbolic perspective on LLMs
TIGER-AI-Lab/MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding...
deep-symbolic-mathematics/LLM-SR
[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on Scientific Equation...
microsoft/interwhen
A framework for verifiable reasoning with language models.
zhudotexe/fanoutqa
Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language...