nl4opt/ORQA
[AAAI 2025] ORQA is a new QA benchmark designed to assess the reasoning capabilities of LLMs in a specialized technical domain of Operations Research. The benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when presented with complex optimization modeling tasks.
This benchmark helps evaluate how well large language models (LLMs) understand and apply complex optimization concepts found in Operations Research. It takes real-world optimization problem descriptions and related questions as input, then assesses if an LLM can correctly identify model components and reasoning. Anyone working on developing or deploying LLMs for technical problem-solving, particularly in supply chain, logistics, or resource allocation, would use this to gauge their model's expertise.
No commits in the last 6 months.
Use this if you need to objectively measure a large language model's ability to reason through and solve problems in the specialized domain of Operations Research.
Not ideal if you are looking for a tool to solve an Operations Research problem directly, as this is a benchmark for evaluating LLMs, not an OR solver.
Stars
45
Forks
2
Language
Python
License
—
Category
Last pushed
Jun 07, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/nl4opt/ORQA"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
amazon-science/auto-rag-eval
Code repo for the ICML 2024 paper "Automated Evaluation of Retrieval-Augmented Language Models...
ibm-self-serve-assets/JudgeIt-LLM-as-a-Judge
Automation Framework using LLM-as-a-judge to evaluate of Agentic AI, RAG, Text2SQL at scale;...
explore-de/rage4j
Evaluate your LLM based Java Apps
mit-ll-ai-technology/llm-sandbox
Large language model evaluation framework for logic and open-ended Q&A with a vareity of RAG and...