nl4opt/ORQA

[AAAI 2025] ORQA is a new QA benchmark designed to assess the reasoning capabilities of LLMs in a specialized technical domain of Operations Research. The benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when presented with complex optimization modeling tasks.

/ 100

Experimental

This benchmark helps evaluate how well large language models (LLMs) understand and apply complex optimization concepts found in Operations Research. It takes real-world optimization problem descriptions and related questions as input, then assesses if an LLM can correctly identify model components and reasoning. Anyone working on developing or deploying LLMs for technical problem-solving, particularly in supply chain, logistics, or resource allocation, would use this to gauge their model's expertise.

No commits in the last 6 months.

Use this if you need to objectively measure a large language model's ability to reason through and solve problems in the specialized domain of Operations Research.

Not ideal if you are looking for a tool to solve an Operations Research problem directly, as this is a benchmark for evaluating LLMs, not an OR solver.

Operations Research Optimization Modeling LLM Evaluation AI Reasoning Assessment Complex Problem Solving

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 8 / 25

Maturity 8 / 25

Community 5 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

amazon-science/auto-rag-eval

Code repo for the ICML 2024 paper "Automated Evaluation of Retrieval-Augmented Language Models...

ibm-self-serve-assets/JudgeIt-LLM-as-a-Judge

Automation Framework using LLM-as-a-judge to evaluate of Agentic AI, RAG, Text2SQL at scale;...

explore-de/rage4j

Evaluate your LLM based Java Apps

mit-ll-ai-technology/llm-sandbox

Large language model evaluation framework for logic and open-ended Q&A with a vareity of RAG and...

Explore RAG Tools

All categories Trending RAG directory Insights