peng-gao-lab/CTIArena
The first benchmark to evaluate LLM performance on heterogeneous CTI under knowledge-augmented settings.
This project helps cybersecurity researchers and developers evaluate how well large language models (LLMs) understand and reason about various types of cyber threat intelligence (CTI). It takes diverse CTI data – structured, unstructured, and hybrid – and an LLM's responses, then provides a performance benchmark. Cybersecurity researchers, ML engineers in security, and AI developers building CTI analysis tools are the primary users.
Use this if you need to quantitatively measure and compare the effectiveness of different LLMs in processing and understanding complex, real-world cyber threat intelligence.
Not ideal if you are looking for an off-the-shelf CTI analysis tool or a solution for direct threat detection in an operational security environment.
Stars
9
Forks
—
Language
Python
License
—
Category
Last pushed
Oct 15, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/peng-gao-lab/CTIArena"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
modelscope/evalscope
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation...
izam-mohammed/ragrank
🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it...
Kareem-Rashed/rubric-eval
Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.
justplus/llm-eval
大语言模型评估平台,支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。
relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications