Siyu Jiang

T_{\mathrm{eff}}:=\max(m_{1},m_{2})/m_{\mathrm{fuse}}\geq 1, which becomes larger when the two branches disagree in their margins, yielding a more conservative distribution. To quantify confidence moderation, we report the confidence–shrinkage ratio (CSR) under a Top-2 diagnostic view. Let s(m)=1/(1+exp⁡(−m))s(m)=1/(1+\exp(-m)) map a logit margin to a scalar confidence proxy (exact for binary, used here as a probe when Top-2 is preserved). Using the effective temperature view mfuse=max⁡(m1,m2)/Teffm_{\mathrm{fuse}}=\max(m_{1},m_{2})/T_{\mathrm{eff}}, we define CSR(x):=s(max⁡(m1,m2)/Teff(x))s(max⁡(m1,m2)).\mathrm{CSR}(x):=\frac{s\!\big(\max(m_{1},m_{2})/T_{\mathrm{eff}}(x)\big)}{s\!\big(\max(m_{1},m_{2})\big)}. (6) Empirically, CSR(x)<1\mathrm{CSR}(x)<1 is most pronounced on disagreement subsets (Fig. 3), indicating more conservative predictions when branches disagree. For fixed branch logits, the cross-entropy is convex in the scalar gate gg, which makes optimization of the gate well-behaved in isolation (Appendix C). 3.4 Complexity and Implementation The Knob family is designed for efficiency, as the gating mechanism introduces only 𝒪(D)\mathcal{O}(D) overhead. This is because the target displacement network u∗(x)u^{*}(x) is a lightweight MLP and the physical parameters ζ\zeta and ωn\omega_{n} are scalars, resulting in negligible increases in parameters, GFLOPs, or latency, as shown in Table 3. The lightweight ODE-Lite variant, which uses a simple first-order exponential moving average (EMA) on the gate, offers a particularly attractive trade-off between performance and cost. 4 Experiments Our experiments aim to: (1) compare the Knob framework’s overall performance with established baselines on a benchmark featuring standard distribution shifts, and (2) validate theoretical claims through targeted empirical probe experiments. 4.1 Experimental Setup and Metrics Datasets and Model. We use CIFAR-10 (Krizhevsky, 2009) for training and CIFAR-10-C (Hendrycks and Dietterich, 2019) for evaluation under distribution shift. CIFAR-10-C subjects the test set to 19 different types of corruptions (e.g., Gaussian noise, blur, snow) across 5 severity levels, serving as a rigorous stress test for model calibration. Our backbone model is a ResNet-18 (He et al., 2016). All methods share the same architecture and training protocol for fair comparison. Training Protocol. Models are trained for 100 epochs using AdamW optimizer with automatic mixed precision. We employ a curriculum sampling strategy that gradually increases corruption severity, which we found stabilizes the learning of the gating mechanism (see Appendix A for hyperparameter details). Evaluation Metrics. To provide a comprehensive assessment, we use a suite of metrics focused on accuracy, calibration, and efficiency: • Avg-C (%): Mean accuracy averaged first over corruption severities, then over corruption types. • ECEdeb: Debiased Expected Calibration Error; measures gap between confidence and accuracy (lower is better). • Err-C (%): Mean classification error on CIFAR-10-C, defined as Err-C=100−Avg-C\mathrm{Err\text{-}C}=100-\mathrm{Avg\text{-}C} (lower is better). We report it alongside Avg-C for readability. • GFLOPs / Latency (ms): Computational efficiency metrics. Table 1: Terminology and Metric Definitions. Symbol/Term

Papers on Lattice

Total citations

Topics

h-index