TIGER-AI-Lab/LongICLBench

Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]

36
/ 100
Emerging

This project helps evaluate how well large language models (LLMs) perform when classifying text into many different categories, especially when given a lot of examples to learn from. It takes various text classification datasets with many labels (like emotions or intents) and different LLMs as input, then measures how accurately the models classify the text. AI researchers and practitioners who are building or deploying LLMs for complex classification tasks would use this.

112 stars. No commits in the last 6 months.

Use this if you need to benchmark the performance of large language models on challenging text classification tasks with a high number of categories and long context windows.

Not ideal if you are looking to fine-tune an LLM or want a tool for general-purpose natural language processing tasks beyond extreme-label classification.

LLM evaluation text classification natural language processing AI model benchmarking machine learning research
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 9 / 25
Maturity 16 / 25
Community 11 / 25

How are scores calculated?

Stars

112

Forks

8

Language

Python

License

MIT

Last pushed

Feb 20, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/TIGER-AI-Lab/LongICLBench"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.