haesleinhuepf/human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

/ 100

Emerging

This tool helps bio-image analysis researchers, lab managers, and scientists evaluate how well different AI models (Large Language Models) can write Python code for their specific image analysis tasks. You provide a set of bio-image analysis problems and their human-written solutions, and it automatically tests various AI models, showing you which ones generate correct code. The output is a clear report on the accuracy of each AI model's generated code.

No commits in the last 6 months.

Use this if you need to compare different AI code-generation models to find the most reliable one for automating bio-image analysis scripting, or if you're developing new test cases for bio-image analysis programming challenges.

Not ideal if you are looking for a tool to generate bio-image analysis code directly, or if you only need to run existing analysis scripts.

bio-image analysis microscopy scientific computing research automation code evaluation

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

MIT

Higher-rated alternatives

eth-sri/matharena

Evaluation of LLMs on latest math competitions

tatsu-lab/alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality,...

HPAI-BSC/TuRTLe

TuRTLe: A Unified Evaluation of LLMs for RTL Generation 🐢 (MLCAD 2025)

nlp-uoregon/mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

ShuntaroOkuma/adapt-gauge-core

Measure LLM adaptation efficiency — how fast models learn from few examples

Explore Transformer Models

All categories Trending Transformer directory Insights