peng-gao-lab/CTIArena

The first benchmark to evaluate LLM performance on heterogeneous CTI under knowledge-augmented settings.

/ 100

Experimental

This project helps cybersecurity researchers and developers evaluate how well large language models (LLMs) understand and reason about various types of cyber threat intelligence (CTI). It takes diverse CTI data – structured, unstructured, and hybrid – and an LLM's responses, then provides a performance benchmark. Cybersecurity researchers, ML engineers in security, and AI developers building CTI analysis tools are the primary users.

Use this if you need to quantitatively measure and compare the effectiveness of different LLMs in processing and understanding complex, real-world cyber threat intelligence.

Not ideal if you are looking for an off-the-shelf CTI analysis tool or a solution for direct threat detection in an operational security environment.

cybersecurity-research threat-intelligence-analysis LLM-benchmarking security-operations AI-in-cybersecurity

No License No Package No Dependents

Maintenance 6 / 25

Adoption 5 / 25

Maturity 7 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

—

Higher-rated alternatives

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation...

izam-mohammed/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it...

Kareem-Rashed/rubric-eval

Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.

justplus/llm-eval

大语言模型评估平台，支持多种评估基准、自定义数据集和性能测试。支持基于自定义数据集的RAG评估。

relari-ai/continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

Explore RAG Tools

All categories Trending RAG directory Insights