wschella/llm-reliability
Code for the paper "Larger and more instructable language models become less reliable"
This project offers tools to evaluate how consistently Large Language Models (LLMs) respond to instructions, especially as they become larger and more complex. It takes in benchmark datasets and LLM outputs, then provides graded results showing how reliable the models are in tasks like addition, anagrams, and understanding locality. LLM developers, researchers, and product managers can use this to assess and improve the dependability of their models.
No commits in the last 6 months.
Use this if you need to rigorously test and understand the reliability of large language models across different tasks and identify persistent issues like prompt sensitivity.
Not ideal if you are looking for an off-the-shelf solution for fine-tuning or deploying LLMs, as this is primarily an evaluation and research toolkit.
Stars
31
Forks
2
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Oct 09, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/wschella/llm-reliability"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
PaddlePaddle/PaddleNLP
Easy-to-use and powerful LLM and SLM library with awesome model zoo.
meta-llama/llama-cookbook
Welcome to the Llama Cookbook! This is your go to guide for Building with Llama: Getting started...
arcee-ai/mergekit
Tools for merging pretrained large language models.
changyeyu/LLM-RL-Visualized
๐100+ ๅๅ LLM / RL ๅ็ๅพ๐๏ผใๅคงๆจกๅ็ฎๆณใไฝ่ ๅทจ็ฎ๏ผ๐ฅ๏ผ100+ LLM/RL Algorithm Maps ๏ผ
mindspore-lab/step_into_llm
MindSpore online courses: Step into LLM