Llm Evaluation Benchmarking Transformer Models

There are 30 llm evaluation benchmarking models tracked. The highest-rated is allenai/RL4LMs at 45/100 with 2,382 stars.

Get all 30 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 allenai/RL4LMs

A modular RL library to fine-tune language models to human preferences

45
Emerging
2 emredeveloper/Mem-LLM

Mem-LLM is a Python library for building memory-enabled AI assistants that...

44
Emerging
3 cloudguruab/modsysML

Human reinforcement learning (RLHF) framework for AI models. Evaluate and...

41
Emerging
4 ManasVardhan/bench-my-llm

🏎️ Dead-simple LLM benchmarking CLI - latency, cost, and quality metrics

38
Emerging
5 modal-labs/stopwatch

A tool for benchmarking LLMs on Modal

36
Emerging
6 Mya-Mya/CBF-LLM

"CBF-LLM: Safe Control for LLM Alignment"

35
Emerging
7 lpalbou/AbstractLLM

A unified interface for Large Language Models with memory, reasoning, and...

33
Emerging
8 IIT-DM/BattleofLLMs

Benchmarks of LLMs with Conversational QA datasets.

33
Emerging
9 JonnoB/training_lms_with_synthetic_data

A repo for training Language models to correct errors in OCR text

33
Emerging
10 kanchengw/cnllm

统一的中文大模型适配库,将主流中国大模型 API 输出封装为 OpenAI 格式,无缝协作openai、langchain等大多数openai结构适配的python库

32
Emerging
11 RAravindDS/CharLLMs

Implementing easy to use "Character Level Language Models" 🕺🏽

30
Emerging
12 seclab-yonsei/mia-ko-lm

Performing membership inference attack (MIA) against Korean language models (LMs).

29
Experimental
13 anto18671/lumenspark

Lumenspark is a lightweight Linformer-based Language Model Trained from Scratch

26
Experimental
14 Adora-Foundation/llm-energy-lab

Web application for benchmarking and comparing LLM behaviour, energy and...

25
Experimental
15 ossirytk/llm_resources

Information and resources on everything related about running large language...

23
Experimental
16 MukundaKatta/ModelMux

ModelMux — Multi-Model Router. Intelligent multi-model routing and fallback...

22
Experimental
17 khansavaleria/likelihoodlum

Detect if a GitHub repo’s code was likely generated by an LLM using commit...

22
Experimental
18 gmelli/llm-connectivity

Unified Python interface for multiple Large Language Model providers....

22
Experimental
19 gmelli/llm-judge

A robust Python library for evaluating content using Large Language Models as judges

22
Experimental
20 alextra-lab/slm_server

Unified LLM server with nginx reverse proxy and intelligent routing based on model ID

22
Experimental
21 D0men1c0/Benchmark-Gemma-Models

Highly customizable Python suite for LLM evaluation (Gemma, LLaMA+). Full...

22
Experimental
22 zenprocess/pawbench

PawBench - 4-dimensional LLM inference benchmark. Multi-turn, multi-agent,...

22
Experimental
23 mrconter1/PullRequestBenchmark

Evaluating LLMs performance in PR reviews as an indicator for their...

21
Experimental
24 yc-w-cn/llm-leaderboard

LLM模型对比排行榜 - 帮助用户快速比较不同大语言模型的性能指标、价格和规格

20
Experimental
25 ebarkhordar/voter-behavior-prediction-LLM

This project explores the predictive power of large language models (LLMs)...

20
Experimental
26 MChatzakis/ChatMGL

ChatMGL: A Large Language Model Fine-tuned for Data Science Questions.

20
Experimental
27 wa3dbk/llm-batch

LLM Inference CLI - Batch inference with customizable templates

19
Experimental
28 glaciapag/locallm

A simple Python package that lets you interact with a large language model...

17
Experimental
29 madalinioana/intent-qualification

Hybrid company qualification pipeline using LLM intent parsing, vector...

14
Experimental
30 alok/llmvision

Visualize how LLMs tokenize text - see the world through the eyes of language models

11
Experimental

Comparisons in this category