toxy4ny/redteam-ai-benchmark
Red Team AI Benchmark: Evaluating Uncensored LLMs for Offensive Security
This project helps red team operators and penetration testers objectively assess if an AI assistant, especially a local Large Language Model (LLM), is genuinely useful for offensive security tasks. It takes a local LLM or an API-based LLM as input and evaluates its responses to 12 targeted questions covering advanced red team techniques. The output is a clear score indicating whether the LLM is suitable for real-world penetration testing, helping security professionals choose reliable AI tools.
Use this if you need to determine if an uncensored AI model can provide accurate, working code and technical advice for complex penetration testing scenarios, rather than generic or refused answers.
Not ideal if you are looking to evaluate LLMs for general-purpose coding assistance, creative writing, or tasks outside of offensive security and cybersecurity.
Stars
27
Forks
3
Language
Python
License
MIT
Category
Last pushed
Dec 25, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/toxy4ny/redteam-ai-benchmark"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
LLAMATOR-Core/llamator
Red Teaming python-framework for testing chatbots and GenAI systems.
sleeepeer/PoisonedRAG
[USENIX Security 2025] PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented...
kelkalot/simpleaudit
Allows to red-team your AI systems through adversarial probing. It is simple, effective, and...
JuliusHenke/autopentest
CLI enabling more autonomous black-box penetration tests using Large Language Models (LLMs)
SecurityClaw/SecurityClaw
A modular, skill-based autonomous Security Operations Center (SOC) agent that monitors...