TJ-Neary/AI_Eval

Comprehensive LLM evaluation framework comparing local and cloud models with hardware-aware benchmarking. Evaluate across code generation, document analysis, and structured output using pass@k, LLM-as-Judge, and RAG metrics. Supports Ollama, Google Gemini, Anthropic, and OpenAI.

/ 100

Experimental

No Package No Dependents

Maintenance 10 / 25

Adoption 0 / 25

Maturity 11 / 25

Community 0 / 25

How are scores calculated?

Stars

—

Forks

—

Language

Python

License

MIT

Higher-rated alternatives

FastBuilderAI/memory

FastMemory is a topological representation of text data using concepts as the primary input. It...

syncreus/syncreus-eval

Evaluate your LLM apps with one function call. Hallucination detection, RAG scoring, and agent...

bevinkatti/rag-harness

⚡ CLI to Evaluate and Compare RAG systems with RAGAS-style scoring

verifywise-ai/verifywise-eval-action

GitHub Action & Python SDK to evaluate LLMs in CI/CD — gate PRs on correctness, faithfulness,...

masaakisakamoto/memory-os

Deterministic continuity for AI systems. Detect and repair inconsistencies across sessions — not...

Explore RAG Tools

All categories Trending RAG directory Insights