IBM/unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

/ 100

Established

This tool helps AI and machine learning engineers reliably measure the performance of different AI models across various tasks like text generation, image recognition, or code completion. You provide your AI model and a task, and it outputs detailed performance scores and benchmarks. It is designed for AI practitioners who need to rigorously test and compare their models before deployment.

211 stars. Used by 1 other package. Available on PyPI.

Use this if you need a standardized, comprehensive, and reproducible way to evaluate your AI models against a wide range of existing benchmarks or custom datasets.

Not ideal if you are looking for a simple, single-metric evaluation for a small, one-off model test.

AI-model-evaluation machine-learning-benchmarking natural-language-processing-evaluation computer-vision-evaluation code-generation-evaluation

Maintenance 10 / 25

Adoption 11 / 25

Maturity 25 / 25

Community 23 / 25

How are scores calculated?

Stars

211

Forks

Language

Python

License

Apache-2.0

Featured in

You're Shipping AI You Can't Measure

Related tools

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral,...

lean-dojo/LeanDojo

Tool for data extraction and interacting with Lean programmatically.

GoodStartLabs/AI_Diplomacy

Frontier Models playing the board game Diplomacy.

google/litmus

Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI Application...

salesforce/CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

Explore LLM Tools

All categories Trending LLM Tool directory Insights