1 comments

  • subhadipmitra 8 hours ago
    Hey HN, I built this because most LLM eval tools assume single-machine execution. When you need to evaluate against millions of examples (customer tickets, documents, etc.), they don't scale without significant duct-taping.

      spark-llm-eval runs natively on Spark - not "Spark as an afterthought" but distributed evaluation as the primary design goal.
    
      Key features:
      - Distributed inference via Pandas UDFs, scales linearly with executors
      - Statistical rigor by default: bootstrap CIs, paired t-tests, effect sizes
      - Multi-provider: OpenAI, Anthropic, Gemini, vLLM
      - Delta Lake integration for versioned results with lineage
    
      pip install spark-llm-eval
    
      The main gap I'm filling: "I have 2M labeled examples and need to know if Model A is statistically significantly better than Model B." Most frameworks give you point estimates; this gives you confidence intervals and significance tests.
    
      Blog post with architecture details: https://subhadipmitra.com/blog/2025/building-spark-llm-eval/
    
      Happy to answer questions about the implementation - rate limiting in distributed contexts was surprisingly tricky.