Hey HN, I built this because most LLM eval tools assume single-machine execution. When you need to evaluate against millions of examples (customer tickets, documents, etc.), they don't scale without significant duct-taping.
spark-llm-eval runs natively on Spark - not "Spark as an afterthought" but distributed evaluation as the primary design goal.
Key features:
- Distributed inference via Pandas UDFs, scales linearly with executors
- Statistical rigor by default: bootstrap CIs, paired t-tests, effect sizes
- Multi-provider: OpenAI, Anthropic, Gemini, vLLM
- Delta Lake integration for versioned results with lineage
pip install spark-llm-eval
The main gap I'm filling: "I have 2M labeled examples and need to know if Model A is statistically significantly better than Model B." Most frameworks give you point estimates; this gives you confidence intervals and significance tests.
Blog post with architecture details: https://subhadipmitra.com/blog/2025/building-spark-llm-eval/
Happy to answer questions about the implementation - rate limiting in distributed contexts was surprisingly tricky.
1 comments