Humanity-s-Last-Code-Exam/HLCE

(EMNLP 2025 Findings) Source Evaluation scripts for Humanity's Last Code Exam

/ 100

Experimental

This project helps researchers and developers evaluate the advanced code generation capabilities of large language models (LLMs). It takes in LLM-generated code solutions for extremely difficult programming competition problems and outputs a definitive evaluation of their correctness and performance. This is used by AI researchers, LLM developers, and academic institutions working on cutting-edge language models for complex coding tasks.

No commits in the last 6 months.

Use this if you need to rigorously test and benchmark advanced LLMs against the most challenging competitive programming problems from contests like ICPC World Finals and IOI.

Not ideal if you are looking to evaluate LLMs on everyday coding tasks or standard, less complex programming benchmarks.

LLM evaluation code generation benchmarking AI research competitive programming language model development

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 9 / 25

Maturity 7 / 25

Community 10 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Featured in

You're Shipping AI You Can't Measure

Higher-rated alternatives

EvolvingLMMs-Lab/lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

vibrantlabsai/ragas

Supercharge Your LLM Application Evaluations 🚀

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

EuroEval/EuroEval

The robust European language model benchmark.

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Explore LLM Tools

All categories Trending LLM Tool directory Insights