KID-22/Cocktail
Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration
This project helps researchers and developers evaluate how well their information retrieval (IR) systems find relevant documents when some of the content is generated by AI. It provides a vast collection of datasets, including both human-written and AI-generated texts across various domains and tasks. The output is a performance assessment of IR models on these mixed corpora, helping users understand biases and effectiveness in the era of large language models.
No commits in the last 6 months.
Use this if you need to rigorously test how well your information retrieval model handles a mix of human-written and AI-generated content, especially concerning its responsiveness to new information.
Not ideal if you are looking for a simple information retrieval system to deploy for end-users, as this is a benchmark for evaluating such systems.
Stars
15
Forks
—
Language
Python
License
MIT
Category
Last pushed
Jun 04, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/KID-22/Cocktail"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Higher-rated alternatives
sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
bigcode-project/bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
THUDM/AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems