night-chen/ToolQA
ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels (easy/hard) across eight real-life scenarios.
This project provides a specialized dataset called ToolQA for evaluating how well large language models (LLMs) can answer complex questions that require using external tools. It includes diverse questions from various domains like flight data, Yelp reviews, and scientific texts, along with the corresponding external knowledge sources and potential tools. AI researchers and developers working on improving LLMs' ability to interact with real-world data and execute multi-step reasoning would use this to benchmark their models.
286 stars. No commits in the last 6 months.
Use this if you are a developer or researcher testing or building large language models (LLMs) and need a robust, diverse dataset to evaluate their ability to answer complex questions by using external data and tools.
Not ideal if you are an end-user looking for a direct application to solve a problem with an LLM, as this is a dataset and toolkit for LLM development and evaluation, not a ready-to-use product.
Stars
286
Forks
14
Language
Jupyter Notebook
License
Apache-2.0
Category
Last pushed
Aug 19, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/night-chen/ToolQA"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
monarch-initiative/ontogpt
LLM-based ontological extraction tools, including SPIRES
weAIDB/awesome-data-llm
Official Repository of "LLM Γ DATA" Survey Paper
AXYZdong/AMchat
AM (Advanced Mathematics) Chat is a large language model that integrates advanced mathematical...
skywalker023/sodaverse
π₯€π§π»βπCode and dataset for our EMNLP 2023 paper - "SODA: Million-scale Dialogue Distillation with...
Y-Research-SBU/TimeSeriesScientist
Official Repository for TimeSeriesScientist