google/curie

Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025

34
/ 100
Emerging

This project offers a specialized benchmark for evaluating how well Large Language Models (LLMs) understand and reason with extensive scientific texts. It helps researchers, data scientists, and AI developers assess an LLM's ability to process long research papers and scientific documents, providing structured inputs (like research papers) and expecting accurate, context-aware scientific information extraction and reasoning outputs. Scientists or AI developers working on scientific applications of LLMs would use this to gauge model performance on complex, real-world scientific tasks.

No commits in the last 6 months.

Use this if you need to objectively measure a Large Language Model's proficiency in comprehending, reasoning over, and extracting information from lengthy scientific documents across diverse domains like materials science, quantum computing, or biodiversity.

Not ideal if you are looking for a tool to train LLMs or apply them directly to scientific problems; this is purely for evaluating existing models on specific, long-context scientific tasks.

scientific-LLM-evaluation materials-science quantum-computing geospatial-analysis biodiversity-research
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 7 / 25
Maturity 16 / 25
Community 9 / 25

How are scores calculated?

Stars

29

Forks

3

Language

Jupyter Notebook

License

Apache-2.0

Last pushed

Apr 21, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/google/curie"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.