RenzeLou/AAAR-1.0

The source code for running LLMs on the AAAR-1.0 benchmark.

/ 100

Experimental

This project helps researchers and AI practitioners evaluate how well large language models (LLMs) can assist with common research tasks. It provides a benchmark to test LLMs on inferring equations from papers, designing experiments from research proposals, identifying weaknesses in paper drafts, and critiquing peer reviews. The output includes performance metrics for various LLMs on these tasks, helping users understand their capabilities.

No commits in the last 6 months.

Use this if you are a researcher or AI practitioner who needs to rigorously assess and compare the performance of different LLMs in assisting with academic research workflows.

Not ideal if you are looking for a tool to directly apply LLMs to your research tasks without needing to evaluate their performance against a benchmark.

AI-research academic-writing peer-review experiment-design LLM-evaluation

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

MIT

Higher-rated alternatives

monarch-initiative/ontogpt

LLM-based ontological extraction tools, including SPIRES

weAIDB/awesome-data-llm

Official Repository of "LLM × DATA" Survey Paper

AXYZdong/AMchat

AM (Advanced Mathematics) Chat is a large language model that integrates advanced mathematical...

skywalker023/sodaverse

🥤🧑🏻‍🚀Code and dataset for our EMNLP 2023 paper - "SODA: Million-scale Dialogue Distillation with...

Y-Research-SBU/TimeSeriesScientist

Official Repository for TimeSeriesScientist

Explore LLM Tools

All categories Trending LLM Tool directory Insights