llamator and redteam-ai-benchmark

These are complementary tools: LLAMATOR-Core provides a framework for executing red team tests against chatbots and GenAI systems, while redteam-ai-benchmark supplies a structured evaluation methodology and benchmark dataset for assessing LLM vulnerabilities in offensive security contexts.

llamator
59
Established
redteam-ai-benchmark
36
Emerging
Maintenance 10/25
Adoption 10/25
Maturity 25/25
Community 14/25
Maintenance 6/25
Adoption 7/25
Maturity 13/25
Community 10/25
Stars: 201
Forks: 20
Downloads:
Commits (30d): 0
Language: Python
License:
Stars: 27
Forks: 3
Downloads:
Commits (30d): 0
Language: Python
License: MIT
No risk flags
No Package No Dependents

About llamator

LLAMATOR-Core/llamator

Red Teaming python-framework for testing chatbots and GenAI systems.

This framework helps AI product managers and security engineers systematically test their chatbots and generative AI systems for vulnerabilities. You provide it with your chatbot or GenAI system, and it outputs a test report documenting potential issues like prompt injection, data leakage, and misinformation. This is for professionals responsible for the safety and robustness of AI applications.

AI safety chatbot testing Generative AI security AI ethics risk assessment

About redteam-ai-benchmark

toxy4ny/redteam-ai-benchmark

Red Team AI Benchmark: Evaluating Uncensored LLMs for Offensive Security

This project helps red team operators and penetration testers objectively assess if an AI assistant, especially a local Large Language Model (LLM), is genuinely useful for offensive security tasks. It takes a local LLM or an API-based LLM as input and evaluates its responses to 12 targeted questions covering advanced red team techniques. The output is a clear score indicating whether the LLM is suitable for real-world penetration testing, helping security professionals choose reliable AI tools.

penetration-testing red-teaming offensive-security cybersecurity-auditing LLM-evaluation

Scores updated daily from GitHub, PyPI, and npm data. How scores work