Search papers, labs, and topics across Lattice.
The paper introduces a surrogate-based prevalence measurement framework to estimate content attribute exposure frequency in A/B tests, addressing the scalability limitations of direct labeling approaches. They calibrate a surrogate signal (score bucketing of a model score) using offline labeled data and then estimate prevalence for different experiment arms and segments using impression logs. Validated across multiple large-scale A/B tests, the surrogate estimates closely match reference estimates for both arm-level prevalence and treatment-control deltas, enabling scalable and low-latency prevalence measurement.
Stop running expensive labeling jobs for every A/B test: this framework lets you estimate content prevalence at scale using only impression logs and a one-time calibration.
Online media platforms often need to measure how frequently users are exposed to specific content attributes in order to evaluate trade-offs in A/B experiments. A direct approach is to sample content, label it using a high-quality rubric (e.g., an expert-reviewed LLM prompt), and estimate impression-weighted prevalence. However, repeatedly running such labeling for every experiment arm and segment is too costly and slow to serve as a default measurement at scale. We present a scalable \emph{surrogate-based prevalence measurement} framework that decouples expensive labeling from per-experiment evaluation. The framework calibrates a surrogate signal to reference labels offline and then uses only impression logs to estimate prevalence for arbitrary experiment arms and segments. We instantiate this framework using \emph{score bucketing} as the surrogate: we discretize a model score into buckets, estimate bucket-level prevalences from an offline labeled sample, and combine these calibrated bucket level prevalences with the bucket distribution of impressions in each arm to obtain fast, log-based estimates. Across multiple large-scale A/B tests, we validate that the surrogate estimates closely match the reference estimates for both arm-level prevalence and treatment--control deltas. This enables scalable, low-latency prevalence measurement in experimentation without requiring per-experiment labeling jobs.