v7labs/benchllm
Continuous Integration for LLM powered applications
This tool helps AI engineers and developers ensure their Large Language Models (LLMs) and AI applications are working correctly. You input your LLM's code and a set of expected responses for various prompts, and it automatically tests your application. The output is a detailed report highlighting any inaccurate or 'hallucinated' responses, so you can fix them before deployment.
254 stars. No commits in the last 6 months. Available on PyPI.
Use this if you are building applications powered by LLMs, agents, or chains (like Langchain) and need to consistently verify their accuracy and prevent incorrect outputs across different versions.
Not ideal if you are not developing with Large Language Models or if you need a solution that is already fully stable and mature, as this project is in active, rapid development.
Stars
254
Forks
13
Language
Python
License
MIT
Category
Last pushed
Aug 11, 2023
Commits (30d)
0
Dependencies
5
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/v7labs/benchllm"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
open-compass/opencompass
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral,...
IBM/unitxt
🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the...
lean-dojo/LeanDojo
Tool for data extraction and interacting with Lean programmatically.
GoodStartLabs/AI_Diplomacy
Frontier Models playing the board game Diplomacy.
google/litmus
Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI Application...