AUCOHL/RTL-Repo

RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects - IEEE LAD'24

/ 100

Experimental

This benchmark helps hardware design engineers evaluate how well large language models (LLMs) can generate Verilog code completions within complex, multi-file projects. It takes a trained LLM and a dataset of real-world Verilog code samples as input, then outputs metrics like 'Edit Similarity' and 'Exact Match' to show the LLM's performance. The ideal end-user is a hardware design engineer or an LLM researcher focused on hardware description languages.

No commits in the last 6 months.

Use this if you need to assess how accurately an LLM can generate logically consistent and syntactically correct Verilog code within the context of large digital design projects.

Not ideal if you are looking for an LLM to generate entire RTL designs from high-level specifications, as this focuses on code completion within existing projects.

hardware-design Verilog-development RTL-design digital-logic LLM-evaluation

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 6 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

stanfordnlp/axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

aidatatools/ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)

LarHope/ollama-benchmark

Ollama based Benchmark with detail I/O token per second. Python with Deepseek R1 example.

qcri/LLMeBench

Benchmarking Large Language Models

THUDM/LongBench

LongBench v2 and LongBench (ACL 25'&24')

Explore Transformer Models

All categories Trending Transformer directory Insights