A. Sloan

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Eval Frameworks & Benchmarks (1)Tool Use & Agents (1)

Frequent co-authors

Sunishchal Dev (1)Andrew Sloan (1)Joshua Kavner (1)Nicholas Kong (1)

Papers (1)

Mar 5, 2026

1w ago

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

LLM judges, widely used in AI benchmarks, can be surprisingly unreliable, with simple text formatting changes or paraphrasing leading to inconsistent judgments.

Sunishchal Dev, A. Sloan, Andrew Sloan +3

Eval Frameworks & Benchmarks Tool Use & Agents

Search

A. Sloan

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (1)