hhan1018/NesTools
[COLING 2025] NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models
This project helps AI researchers and developers evaluate how well large language models (LLMs) can learn and use multiple tools in complex, nested sequences. You input the LLM's responses and the evaluation settings, and it outputs performance metrics on nested tool learning. This is for those working on improving LLM capabilities in advanced reasoning and automation.
No commits in the last 6 months.
Use this if you are developing or benchmarking large language models and need to rigorously test their ability to handle complex, multi-step tasks requiring the sequential application of various tools.
Not ideal if you are an end-user looking to apply an LLM to a specific business problem, rather than evaluating the LLM's core capabilities.
Stars
18
Forks
3
Language
Python
License
Apache-2.0
Category
Last pushed
Jan 18, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/hhan1018/NesTools"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
TableBench/TableBench
Official repository for paper "TableBench: A Comprehensive and Complex Benchmark for Table...
asaakyan/ngram-creativity
Repository for the paper Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity
sileod/nlp-verbal-probabilities-reasoning
Probing handling of verbal probabilities in NLP models