google/curie

Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025

/ 100

Emerging

This project offers a specialized benchmark for evaluating how well Large Language Models (LLMs) understand and reason with extensive scientific texts. It helps researchers, data scientists, and AI developers assess an LLM's ability to process long research papers and scientific documents, providing structured inputs (like research papers) and expecting accurate, context-aware scientific information extraction and reasoning outputs. Scientists or AI developers working on scientific applications of LLMs would use this to gauge model performance on complex, real-world scientific tasks.

No commits in the last 6 months.

Use this if you need to objectively measure a Large Language Model's proficiency in comprehending, reasoning over, and extracting information from lengthy scientific documents across diverse domains like materials science, quantum computing, or biodiversity.

Not ideal if you are looking for a tool to train LLMs or apply them directly to scientific problems; this is purely for evaluating existing models on specific, long-context scientific tasks.

scientific-LLM-evaluation materials-science quantum-computing geospatial-analysis biodiversity-research

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 9 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

Apache-2.0

Higher-rated alternatives

ExtensityAI/symbolicai

A neurosymbolic perspective on LLMs

TIGER-AI-Lab/MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding...

deep-symbolic-mathematics/LLM-SR

[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on Scientific Equation...

microsoft/interwhen

A framework for verifiable reasoning with language models.

zhudotexe/fanoutqa

Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language...

Explore Transformer Models

All categories Trending Transformer directory Insights