toxy4ny/redteam-ai-benchmark

Red Team AI Benchmark: Evaluating Uncensored LLMs for Offensive Security

/ 100

Emerging

This project helps red team operators and penetration testers objectively assess if an AI assistant, especially a local Large Language Model (LLM), is genuinely useful for offensive security tasks. It takes a local LLM or an API-based LLM as input and evaluates its responses to 12 targeted questions covering advanced red team techniques. The output is a clear score indicating whether the LLM is suitable for real-world penetration testing, helping security professionals choose reliable AI tools.

Use this if you need to determine if an uncensored AI model can provide accurate, working code and technical advice for complex penetration testing scenarios, rather than generic or refused answers.

Not ideal if you are looking to evaluate LLMs for general-purpose coding assistance, creative writing, or tasks outside of offensive security and cybersecurity.

penetration-testing red-teaming offensive-security cybersecurity-auditing LLM-evaluation

No Package No Dependents

Maintenance 6 / 25

Adoption 7 / 25

Maturity 13 / 25

Community 10 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Compare

redteam-ai-benchmark and llamator

Higher-rated alternatives

LLAMATOR-Core/llamator

Red Teaming python-framework for testing chatbots and GenAI systems.

sleeepeer/PoisonedRAG

[USENIX Security 2025] PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented...

kelkalot/simpleaudit

Allows to red-team your AI systems through adversarial probing. It is simple, effective, and...

JuliusHenke/autopentest

CLI enabling more autonomous black-box penetration tests using Large Language Models (LLMs)

SecurityClaw/SecurityClaw

A modular, skill-based autonomous Security Operations Center (SOC) agent that monitors...

Explore RAG Tools

All categories Trending RAG directory Insights