cloudguruab/modsysML
Human reinforcement learning (RLHF) framework for AI models. Evaluate and compare LLM outputs, test quality, catch regressions and automate.
This tool helps AI engineers and product managers systematically evaluate and compare the outputs of Large Language Models (LLMs). You input different prompts and test cases, and it outputs a table view or structured data (like JSON or CSV) showing how various prompts perform. This allows you to quickly identify the best-performing prompts and catch any performance regressions.
Use this if you are a machine learning engineer or product manager who needs to rigorously test and compare different LLM prompts across many scenarios to ensure model quality and catch regressions.
Not ideal if you need a graphical user interface for visual testing and reporting, as it primarily works via command line or Python library.
Stars
36
Forks
5
Language
Python
License
Apache-2.0
Category
Last pushed
Dec 01, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/cloudguruab/modsysML"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
allenai/RL4LMs
A modular RL library to fine-tune language models to human preferences
emredeveloper/Mem-LLM
Mem-LLM is a Python library for building memory-enabled AI assistants that run entirely on local...
ManasVardhan/bench-my-llm
🏎️ Dead-simple LLM benchmarking CLI - latency, cost, and quality metrics
modal-labs/stopwatch
A tool for benchmarking LLMs on Modal
Mya-Mya/CBF-LLM
"CBF-LLM: Safe Control for LLM Alignment"