cloudguruab/modsysML

Human reinforcement learning (RLHF) framework for AI models. Evaluate and compare LLM outputs, test quality, catch regressions and automate.

/ 100

Emerging

This tool helps AI engineers and product managers systematically evaluate and compare the outputs of Large Language Models (LLMs). You input different prompts and test cases, and it outputs a table view or structured data (like JSON or CSV) showing how various prompts perform. This allows you to quickly identify the best-performing prompts and catch any performance regressions.

Use this if you are a machine learning engineer or product manager who needs to rigorously test and compare different LLM prompts across many scenarios to ensure model quality and catch regressions.

Not ideal if you need a graphical user interface for visual testing and reporting, as it primarily works via command line or Python library.

AI model evaluation LLM prompt engineering machine learning quality assurance AI product management data science workflow

No Package No Dependents

Maintenance 6 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

allenai/RL4LMs

A modular RL library to fine-tune language models to human preferences

emredeveloper/Mem-LLM

Mem-LLM is a Python library for building memory-enabled AI assistants that run entirely on local...

ManasVardhan/bench-my-llm

🏎️ Dead-simple LLM benchmarking CLI - latency, cost, and quality metrics

modal-labs/stopwatch

A tool for benchmarking LLMs on Modal

Mya-Mya/CBF-LLM

"CBF-LLM: Safe Control for LLM Alignment"

Explore Transformer Models

All categories Trending Transformer directory Insights