haesleinhuepf/human-eval-bia
Benchmarking Large Language Models for Bio-Image Analysis Code Generation
This tool helps bio-image analysis researchers, lab managers, and scientists evaluate how well different AI models (Large Language Models) can write Python code for their specific image analysis tasks. You provide a set of bio-image analysis problems and their human-written solutions, and it automatically tests various AI models, showing you which ones generate correct code. The output is a clear report on the accuracy of each AI model's generated code.
No commits in the last 6 months.
Use this if you need to compare different AI code-generation models to find the most reliable one for automating bio-image analysis scripting, or if you're developing new test cases for bio-image analysis programming challenges.
Not ideal if you are looking for a tool to generate bio-image analysis code directly, or if you only need to run existing analysis scripts.
Stars
25
Forks
14
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Nov 21, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/haesleinhuepf/human-eval-bia"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
eth-sri/matharena
Evaluation of LLMs on latest math competitions
tatsu-lab/alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality,...
HPAI-BSC/TuRTLe
TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)
nlp-uoregon/mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
ShuntaroOkuma/adapt-gauge-core
Measure LLM adaptation efficiency — how fast models learn from few examples