google/curie
Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025
This project offers a specialized benchmark for evaluating how well Large Language Models (LLMs) understand and reason with extensive scientific texts. It helps researchers, data scientists, and AI developers assess an LLM's ability to process long research papers and scientific documents, providing structured inputs (like research papers) and expecting accurate, context-aware scientific information extraction and reasoning outputs. Scientists or AI developers working on scientific applications of LLMs would use this to gauge model performance on complex, real-world scientific tasks.
No commits in the last 6 months.
Use this if you need to objectively measure a Large Language Model's proficiency in comprehending, reasoning over, and extracting information from lengthy scientific documents across diverse domains like materials science, quantum computing, or biodiversity.
Not ideal if you are looking for a tool to train LLMs or apply them directly to scientific problems; this is purely for evaluating existing models on specific, long-context scientific tasks.
Stars
29
Forks
3
Language
Jupyter Notebook
License
Apache-2.0
Category
Last pushed
Apr 21, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/google/curie"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
ExtensityAI/symbolicai
A neurosymbolic perspective on LLMs
TIGER-AI-Lab/MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding...
deep-symbolic-mathematics/LLM-SR
[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on Scientific Equation...
microsoft/interwhen
A framework for verifiable reasoning with language models.
zhudotexe/fanoutqa
Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language...