hhan1018/NesTools

[COLING 2025] NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models

/ 100

Emerging

This project helps AI researchers and developers evaluate how well large language models (LLMs) can learn and use multiple tools in complex, nested sequences. You input the LLM's responses and the evaluation settings, and it outputs performance metrics on nested tool learning. This is for those working on improving LLM capabilities in advanced reasoning and automation.

No commits in the last 6 months.

Use this if you are developing or benchmarking large language models and need to rigorously test their ability to handle complex, multi-step tasks requiring the sequential application of various tools.

Not ideal if you are an end-user looking to apply an LLM to a specific business problem, rather than evaluating the LLM's core capabilities.

AI research LLM evaluation tool learning model benchmarking natural language processing

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Related tools

TableBench/TableBench

Official repository for paper "TableBench: A Comprehensive and Complex Benchmark for Table...

asaakyan/ngram-creativity

Repository for the paper Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity

sileod/nlp-verbal-probabilities-reasoning

Probing handling of verbal probabilities in NLP models

Explore NLP Tools

All categories Trending NLP directory Insights