night-chen/ToolQA

ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels (easy/hard) across eight real-life scenarios.

36
/ 100
Emerging

This project provides a specialized dataset called ToolQA for evaluating how well large language models (LLMs) can answer complex questions that require using external tools. It includes diverse questions from various domains like flight data, Yelp reviews, and scientific texts, along with the corresponding external knowledge sources and potential tools. AI researchers and developers working on improving LLMs' ability to interact with real-world data and execute multi-step reasoning would use this to benchmark their models.

286 stars. No commits in the last 6 months.

Use this if you are a developer or researcher testing or building large language models (LLMs) and need a robust, diverse dataset to evaluate their ability to answer complex questions by using external data and tools.

Not ideal if you are an end-user looking for a direct application to solve a problem with an LLM, as this is a dataset and toolkit for LLM development and evaluation, not a ready-to-use product.

LLM evaluation tool-augmented AI natural language processing AI research machine learning datasets
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 10 / 25

How are scores calculated?

Stars

286

Forks

14

Language

Jupyter Notebook

License

Apache-2.0

Last pushed

Aug 19, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/night-chen/ToolQA"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.