Llm Evaluation Benchmarking Transformer Models

There are 30 llm evaluation benchmarking models tracked. The highest-rated is allenai/RL4LMs at 45/100 with 2,382 stars.

Get all 30 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=llm-evaluation-benchmarking&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	allenai/RL4LMs A modular RL library to fine-tune language models to human preferences	45	Emerging	2,382	Python
2	emredeveloper/Mem-LLM Mem-LLM is a Python library for building memory-enabled AI assistants that...	44	Emerging	7	Python
3	cloudguruab/modsysML Human reinforcement learning (RLHF) framework for AI models. Evaluate and...	41	Emerging	36	Python
4	ManasVardhan/bench-my-llm 🏎️ Dead-simple LLM benchmarking CLI - latency, cost, and quality metrics	38	Emerging	1	Python
5	modal-labs/stopwatch A tool for benchmarking LLMs on Modal	36	Emerging	50	Python
6	Mya-Mya/CBF-LLM "CBF-LLM: Safe Control for LLM Alignment"	35	Emerging	12	Python
7	lpalbou/AbstractLLM A unified interface for Large Language Models with memory, reasoning, and...	33	Emerging	2	Python
8	IIT-DM/BattleofLLMs Benchmarks of LLMs with Conversational QA datasets.	33	Emerging	6	Python
9	JonnoB/training_lms_with_synthetic_data A repo for training Language models to correct errors in OCR text	33	Emerging	2	Python
10	kanchengw/cnllm 统一的中文大模型适配库，将主流中国大模型 API 输出封装为 OpenAI 格式，无缝协作openai、langchain等大多数openai结构适配的python库	32	Emerging	1	Python
11	RAravindDS/CharLLMs Implementing easy to use "Character Level Language Models" 🕺🏽	30	Emerging	6	Python
12	seclab-yonsei/mia-ko-lm Performing membership inference attack (MIA) against Korean language models (LMs).	29	Experimental	7	Python
13	anto18671/lumenspark Lumenspark is a lightweight Linformer-based Language Model Trained from Scratch	26	Experimental	1	Python
14	Adora-Foundation/llm-energy-lab Web application for benchmarking and comparing LLM behaviour, energy and...	25	Experimental	11	Python
15	ossirytk/llm_resources Information and resources on everything related about running large language...	23	Experimental	4	—
16	MukundaKatta/ModelMux ModelMux — Multi-Model Router. Intelligent multi-model routing and fallback...	22	Experimental	—	Python
17	khansavaleria/likelihoodlum Detect if a GitHub repo’s code was likely generated by an LLM using commit...	22	Experimental	—	Python
18	gmelli/llm-connectivity Unified Python interface for multiple Large Language Model providers....	22	Experimental	—	Python
19	gmelli/llm-judge A robust Python library for evaluating content using Large Language Models as judges	22	Experimental	—	Python
20	alextra-lab/slm_server Unified LLM server with nginx reverse proxy and intelligent routing based on model ID	22	Experimental	—	Python
21	D0men1c0/Benchmark-Gemma-Models Highly customizable Python suite for LLM evaluation (Gemma, LLaMA+). Full...	22	Experimental	5	Python
22	zenprocess/pawbench PawBench - 4-dimensional LLM inference benchmark. Multi-turn, multi-agent,...	22	Experimental	—	Python
23	mrconter1/PullRequestBenchmark Evaluating LLMs performance in PR reviews as an indicator for their...	21	Experimental	13	Python
24	yc-w-cn/llm-leaderboard LLM模型对比排行榜 - 帮助用户快速比较不同大语言模型的性能指标、价格和规格	20	Experimental	1	TypeScript
25	ebarkhordar/voter-behavior-prediction-LLM This project explores the predictive power of large language models (LLMs)...	20	Experimental	6	Jupyter Notebook
26	MChatzakis/ChatMGL ChatMGL: A Large Language Model Fine-tuned for Data Science Questions.	20	Experimental	5	Jupyter Notebook
27	wa3dbk/llm-batch LLM Inference CLI - Batch inference with customizable templates	19	Experimental	—	Python
28	glaciapag/locallm A simple Python package that lets you interact with a large language model...	17	Experimental	1	Python
29	madalinioana/intent-qualification Hybrid company qualification pipeline using LLM intent parsing, vector...	14	Experimental	—	Python
30	alok/llmvision Visualize how LLMs tokenize text - see the world through the eyes of language models	11	Experimental	—	Python

Comparisons in this category

AbstractLLM and llm-connectivity (33 vs 22)