Search papers, labs, and topics across Lattice.
This paper introduces RubricsTree, a scalable evaluation framework designed for personal health agents that leverages a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics. By utilizing an iterative human-in-the-loop curation process informed by 4,000 real user queries, RubricsTree achieves expert alignment and significantly improves evaluation quality compared to traditional methods. The framework not only penalizes contextually degraded responses but also enhances model performance by up to 66% on HealthBench when used as structured instructions or training rewards.
RubricsTree transforms the evaluation landscape for personal health agents, achieving expert alignment and significant performance gains while addressing the scalability challenge in clinical deployment.
The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.