Mario Trapp

Technical University of Munich, Mmax⁡(0,confi−conf).\mathcal{L}_{\text{acl}}=\frac{1}{M}\sum_{i=1}^{M}\max(0,\textit{conf}_{i}-\textit{conf}). (7) Appendix F More Ablation Studies Parameter Sensitivity. We evaluate the sensitivity of our framework to two key hyperparameters using the HMDB51 dataset. First, the maximum swapping dimension nmaxn_{max} for Multimodal Feature Swapping (MFS) was varied among 128128, 256256, and 512512, with results presented in Table 10. An nmaxn_{max} value of 256256 yielded the optimal balance, achieving robust performance across all evaluation metrics. Subsequently, with nmaxn_{max} fixed at 256256, the weight λacl\lambda_{\text{acl}} for Adaptive Confidence Loss (ACL) was evaluated over the set 0.20.2, 0.50.5, 1.01.0, and 2.02.0 (detailed in Table 10). A value of λacl=2.0\lambda_{\text{acl}}=2.0 consistently delivered the strongest FD performance. Importantly, the framework’s performance remained stable across both parameter sweeps, underscoring its robustness to variations in these hyperparameters. AURC↓\downarrow AUROC↑\uparrow FPR95↓\downarrow ACC↑\uparrow 128 29.08 88.27 49.57 86.66 256 25.11 90.55 46.22 86.43 512 25.34 90.98 43.90 85.97, Fraunhofer IKS Abstract The deployment of multimodal models in high-stakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction is significantly lower than that of at least one unimodal branch, a phenomenon we term confidence degradation. To mitigate this, we introduce an Adaptive Confidence Loss that penalizes such degradations during training. In addition, we propose Multimodal Feature Swapping, a novel outlier synthesis technique that generates challenging, failure-aware training examples. By training with these synthetic failures, ACR learns to more effectively recognize and reject uncertain predictions, thereby improving overall reliability. Extensive experiments across four datasets, three modalities, and multiple evaluation settings demonstrate that ACR achieves consistent and robust gains. The source code will be available at https://github.com/mona4399/ACR. 1 Introduction Multimodal models are increasingly adopted in safety-critical domains such as autonomous driving and medical diagnostics [14, 57, 20]. By integrating complementary cues from diverse modalities (e.g., video, audio), they often achieve superior robustness and generalization over unimodal approaches [18, 62]. However, even state-of-the-art models can be dangerously overconfident in their erroneous predictions [67], posing serious risks in high-stakes applications. In such settings, detecting untrustworthy predictions is as crucial as achieving high overall accuracy. While prior work in uncertainty estimation [35], calibration [22], and out-of-distribution (OOD) detection [41] has aimed to mitigate overconfidence, these methods often fail to reliably flag individual predictions that should be rejected. Failure detection (FD) – also referred to as misclassification detection or selective classification – directly addresses this challenge by identifying unreliable predictions for potential rejection or human intervention, thereby reducing the risk of catastrophic failures [19]. While FD is well-established in unimodal settings, with methods spanning confidence-based scoring [21, 30], outlier exposure [7, 73], and confidence learning [10, 45], its extension to multimodal systems remains largely unexplored. This gap is non-trivial, as unimodal approaches often fail to effectively leverage the complementary information across modalities or to handle failure modes unique to multimodal data, such as signal conflict and misalignment [50]. Furthermore, some works [16, 37] explore OOD detection with multiple modalities, but their settings fundamentally differ from those of FD. To illustrate the potential benefits of utilizing multiple modalities for FD, we present empirical results on the HMDB51 dataset [34]. All models in this analysis were trained solely with a standard cross-entropy loss. As shown in Figure 1 (left), a simple fusion of video and optical flow inputs substantially improves FD performance – measured by AURC, AUROC, and FPR95 – over unimodal baselines. This finding highlights the considerable potential of multimodal signals for improving FD. Concurrently, Figure 1 (right) reveals that sophisticated OOD detection methods like Energy [41], Entropy [59], and MaxLogit [25] are outperformed by a simple Maximum Softmax Probability (MSP) baseline [26]. Taken together, these findings demonstrate that merely adapting OOD techniques is insufficient and motivate the development of dedicated methods tailored for multimodal FD. Figure 1: (Left) Multimodal models substantially enhance FD performance compared to unimodal models, without the need for complex designs. (Right) Advanced OOD detection methods underperform on FD tasks, while the simple MSP baseline surprisingly remains the most effective. In this work, we identify and systematically characterize the phenomenon of confidence degradation, a scenario where the confidence of fused multimodal predictions undesirably falls below that of individual unimodal predictions, particularly in misclassified instances. To address this, we propose Adaptive Confidence Regularization (ACR), the first dedicated framework for detecting failures (i.e., misclassifications) in multimodal systems. ACR comprises two key innovations: (1) an Adaptive Confidence Loss that explicitly penalizes confidence degradation during training, and (2) Multimodal Feature Swapping, a novel augmentation technique that synthesizes challenging, failure-aware training samples by swapping cross-modal embeddings. Training with the confidence penalty and failure-aware outliers improves the model’s ability to detect and reject uncertain samples, yielding gains in both accuracy and FD performance. Comprehensive experiments across five datasets and five modalities demonstrate that ACR sets a new state of the art, outperforming prior best methods by up to 9.58%9.58\% in AURC, 1.63%1.63\% in AUROC, and 15.45%15.45\% in FPR95. Further ablation studies under distribution shifts and multimodal OOD detection settings confirm the robustness and strong generalization of our approach. The primary contributions of this work are: • We highlight the importance of leveraging multimodal inputs for effective FD, and provide empirical evidence on the limitations of existing OOD detection approaches in this context. • We reveal and empirically validate the phenomenon of confidence degradation in multimodal models, showing its strong correlation with prediction failures. • We propose ACR, the first dedicated framework tailored to the complex task of multimodal FD. ACR integrates a novel Adaptive Confidence Loss, addressing the issue of confidence degradation, and introduces Multimodal Feature Swapping to further enhance confidence reliability. • We perform extensive evaluations across diverse datasets and modalities, demonstrating the robustness and effectiveness of ACR in a wide range of scenarios. 2 Methodology 2.1 Problem Setup Multimodal Failure Detection aims to detect misclassified samples using multiple modalities. We consider a training set 𝔻={(𝐱i,yi)}i=1n\mathbb{D}=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n} drawn i.i.d. from the joint data distribution P𝒳𝒴P_{\mathcal{X}\mathcal{Y}}, where 𝒳\mathcal{X} is the input space and 𝒴={1,2,…,C}\mathcal{Y}=\{1,2,...,C\} is the label space. Each sample 𝐱i\mathbf{x}_{i} is composed of MM modalities, denoted as 𝐱i={xik∣k=1,⋯,M}\mathbf{x}_{i}=\{x_{i}^{k}\mid k=1,\cdots,M\}. Let f:𝒳↦ℝCf:\mathcal{X}\mapsto\mathbb{R}^{C} be a neural network trained on samples in P𝒳𝒴P_{\mathcal{X}\mathcal{Y}} that predicts the label of each input sample. The ff in multimodal failure detection comprises MM feature extractors gk(⋅)g_{k}(\cdot) and a classifier h(⋅)h(\cdot). Each feature extractor gk(⋅)g_{k}(\cdot) extracts an embedding 𝐄k\mathbf{E}^{k} for its corresponding modality kk, and the classifier h(⋅)h(\cdot) takes the combined embeddings from all modalities as input and outputs a prediction probability p^\hat{p}: p^=δ(f(𝐱))=δ(h([g1(x1),…,gM(xM)]))=δ(h([𝐄1,…,𝐄M])),\begin{split}\hat{p}=\delta(f(\mathbf{x}))&=\delta(h([g_{1}(x^{1}),...,g_{M}(x^{M})]))\\ &=\delta(h([\mathbf{E}^{1},...,\mathbf{E}^{M}])),\end{split} (1) where δ(⋅)\delta(\cdot) is the softmax function. We further include a classifier hk(⋅)h_{k}(\cdot) for each modality kk to get predictions from each modality separately, with the prediction probability from the kk-th modality as p^k=δ(hk(gk(xk)))\hat{p}^{k}=\delta(h_{k}(g_{k}(x^{k}))). To safely deploy classifier ff in real-world applications, it should not only be able to make accurate predictions but also distinguish and reject incorrect ones. Formally, let κ(⋅)\kappa(\cdot) be a confidence-scoring function that quantifies the model’s confidence in its prediction. With a predefined threshold τ∈ℝ+\tau\in\mathbb{R}^{+}, the misclassified samples can be detected based on a decision function GG such that for a given input 𝐱\mathbf{x}: G(𝐱)={correctifκ(𝐱)≥τ,misclassifiedotherwise.G(\mathbf{x})=\left\{\begin{aligned} &\text{correct}~~~~~~~~~~\text{if}~~\kappa(\mathbf{x})\geq\tau,\\ &\text{misclassified}~~~~\text{otherwise}.\end{aligned}\right. (2) For example, we can easily use MSP [26] as the confidence-scoring function for a given input 𝐱\mathbf{x} as κ(𝐱)=maxy∈𝒴⁡p^\kappa(\mathbf{x})=\max_{y\in\mathcal{Y}}\hat{p}. Similarly, other confidence-scoring functions can be adapted from the OOD detection literature, such as MaxLogit [25], Energy [41], and Entropy [5]. 2.2 Confidence Degradation: A Failure Indicator in Multimodal Systems We begin by investigating the relationship between multimodal and unimodal prediction confidences to identify systematic patterns that distinguish correct classifications from errors. Our analysis, which uses MSP for confidence scoring, spans four diverse action recognition datasets: HMDB51 [34], EPIC-Kitchens [11], HAC [15], and Kinetics-600 [31]. We consistently observe a specific failure pattern where the confidence of multimodal prediction p^\hat{p} falls below that of an individual modality p^k\hat{p}^{k}. We formalize this phenomenon as follows: Definition 1 (Confidence Degradation). A sample is considered to exhibit confidence degradation if the confidence of the fused multimodal prediction is strictly lower than that of at least one of its unimodal counterparts: ∃k∈{1,…,M}s.t.maxy∈𝒴⁡p^<maxy∈𝒴⁡p^k.\exists\,k\in\{1,\dots,M\}\quad\text{s.t.}\quad\max_{y\in\mathcal{Y}}\hat{p}<\max_{y\in\mathcal{Y}}\hat{p}^{k}. Figure 2: Misclassified samples exhibit a significantly higher proportion of confidence degradation compared to correctly classified ones. Figure 2 illustrates the central finding: confidence degradation is strongly associated with prediction failures. Across all datasets, misclassified samples consistently exhibit a markedly higher rate of degradation than correct predictions, with increases of 32.4%32.4\% on HMDB51, 23.1%23.1\% on EPIC-Kitchens, 52.4%52.4\% on HAC, and 22.0%22.0\% on Kinetics-600. This suggests that failures in multimodal systems frequently coincide with such confidence degradation. One explanation is that misclassified samples often contain conflicting or ambiguous signals across modalities, which increases uncertainty. When their unimodal outputs are fused, this uncertainty frequently causes the combined confidence to drop below that of at least one unimodal branch. In contrast, correctly classified samples typically exhibit agreement across modalities, leading to boosted or at least non‐degraded fusion confidence. This directly motivates our adaptive training objective, which explicitly penalizes confidence degradation. 2.3 Proposed ACR Framework We introduce Adaptive Confidence Regularization (ACR), a novel framework for multimodal failure detection that integrates two complementary components (Figure 3). First, motivated by the strong correlation between misclassification and confidence degradation, we propose an Adaptive Confidence Loss that directly penalizes this degradation during training. Second, we introduce Multimodal Feature Swapping, an outlier synthesis technique that generates challenging, failure-aware training samples by exchanging cross-modal embeddings. By training on these synthesized failures, ACR learns a more robust uncertainty representation, improving its ability to reject unreliable predictions. The ACR architecture processes inputs from multiple modalities. Each input is passed through a modality-specific encoder to yield an embedding, e.g., 𝐄1\mathbf{E}^{1} and 𝐄2\mathbf{E}^{2} for modalities 11 and 22. These embeddings are then concatenated, 𝐄=[𝐄1,𝐄2]\mathbf{E}=[\mathbf{E}^{1},\mathbf{E}^{2}], and fed into a fusion classifier to produce the final multimodal prediction p^\hat{p} with confidence conf=maxy∈𝒴⁡p^\textit{conf}=\max_{y\in\mathcal{Y}}\hat{p}. In parallel, each unimodal embedding 𝐄k\mathbf{E}^{k} is also passed through a dedicated classifier to obtain the unimodal prediction p^k\hat{p}^{k} and its confidence confk\textit{conf}_{k}. Figure 3: Our ACR framework integrates two principal components. The Adaptive Confidence Loss is designed to penalize the phenomenon of confidence degradation. The Multimodal Feature Swapping serves to generate challenging, failure-aware training instances. This process enables the model to learn to more effectively identify and reject uncertain samples. 2.4 Adaptive Confidence Loss Ideally, effective multimodal fusion should achieve synergy, where the confidence of a fused prediction surpasses that of any single modality, assuming all modalities provide predictive information for the target [63]. This reflects the successful integration of complementary information to reduce uncertainty and reinforce the decision. However, as we observe in Section 2.2, misclassifications are strongly correlated with confidence degradation, a phenomenon where the fused confidence falls below that of a unimodal counterpart. Such degradation often arises from conflicting or unreliable signals and serves as a strong indicator of prediction failure. Motivated by this observation, we introduce the Adaptive Confidence Loss (ACL), which encourages the fused confidence to be at least as high as that of any individual modality. For a two-modality case, ACL is defined as: ℒacl=12(max⁡(0,conf1−conf)+max⁡(0,conf2−conf)).\mathcal{L}_{\text{acl}}=\frac{1}{2}\left(\max(0,\textit{conf}_{1}-\textit{conf})+\max(0,\textit{conf}_{2}-\textit{conf})\right). (3) The ACL imposes no penalty when the fused confidence surpasses both unimodal confidences; however, it increasingly penalizes instances where the fused confidence is lower than that of either individual modality. Consequently, ACL encourages the fusion mechanism to learn improved information integration, such that combined evidence from different modalities leads to a more confident prediction. By effectively integrating complementary information from different modalities, ACL enhances prediction reliability. Furthermore, ACL mitigates unimodal overconfidence by penalizing the model when a high-confidence prediction from one modality conflicts with another. To minimize this cross-modal penalty during training, the model learns to reduce the confidence of the unreliable unimodal stream itself. This process effectively regularizes the unimodal networks, forcing them to become better calibrated and less prone to being "confidently wrong". As a result, the model can integrate information more effectively and produce more reliable multimodal predictions. Additional discussion on ACL is provided in the Appendix. 2.5 Multimodal Feature Swapping Figure 4: Visualization on outliers generated by Multimodal Feature Swapping with different nswapn_{\text{swap}} (96, 128, 256, 512). Small swaps produce hard negatives that lie near the in-distribution manifold, while larger swaps create more distinct outliers further away. While Outlier Exposure (OE) is an effective technique for improving OOD detection [27, 69], it has been shown to be ineffective for FD [73]. This is because OE regularizes the decision boundary by compressing the confidence distribution of in-distribution (ID) samples, which inadvertently makes it harder to distinguish correct ID predictions from incorrect ones. A related challenge, particularly in multimodal settings, is the lack of training data that realistically emulates system failures, such as conflicting modality cues or sensor corruption. Although approaches like OpenMix [73] attempt to address these issues by interpolating between ID and outlier data, they have two critical shortcomings for multimodal tasks. First, they depend on large, auxiliary outlier datasets that are often impractical or unavailable. Second, as a fundamentally unimodal method, OpenMix cannot synthesize the complex failure modes that arise from cross-modal interactions. To generate challenging, failure-aware outliers without external data, we propose Multimodal Feature Swapping (MFS). MFS operates by dynamically swapping multimodal feature embeddings and assigning them corresponding soft labels (as illustrated in Figure 3). By generating outliers directly in feature space, MFS ensures computational efficiency and compatibility with various modalities. MFS is designed to ensure that the synthesized features remain distinct from ID features while preserving semantic consistency. Given ID features 𝐄=[𝐄1,𝐄2]\mathbf{E}=[\mathbf{E}^{1},\mathbf{E}^{2}], where 𝐄1\mathbf{E}^{1} represents features from modality 11 and 𝐄2\mathbf{E}^{2} from modality 22, MFS randomly selects a subset of nswap∼𝒰(nmin,nmax)n_{\text{swap}}\sim\mathcal{U}(n_{\text{min}},n_{\text{max}}) continuous feature dimensions from each modality. These selected dimensions are then swapped to obtain new feature representations 𝐄~1\widetilde{\mathbf{E}}^{1} and 𝐄~2\widetilde{\mathbf{E}}^{2}, which are subsequently concatenated to form the multimodal outlier features 𝐄o=[𝐄~1,𝐄~2]\mathbf{E}_{o}=[\widetilde{\mathbf{E}}^{1},\widetilde{\mathbf{E}}^{2}]. A prediction p^o\hat{p}_{o} is then obtained from 𝐄o\mathbf{E}_{o} as p^o=δ(h([𝐄o))\hat{p}_{o}=\delta(h([\mathbf{E}_{o})). To supervise these synthesized outliers, we generate soft labels by interpolating between the original ground-truth one-hot label 𝐲true\mathbf{y}_{\text{true}} and an additional class designated for outliers (e.g., 𝐲outlier=C+1\mathbf{y}_{\text{outlier}}=C+1). The weight λ\lambda for this label interpolation reflects the proportion of features swapped: 𝐲swapped=(1−λ)𝐲true+λ𝐲outlier,whereλ=nswapnmax.\mathbf{y}_{\text{swapped}}=(1-\lambda)\mathbf{y}_{\text{true}}+\lambda\mathbf{y}_{\text{outlier}},\quad\text{where}\quad\lambda=\frac{n_{\text{swap}}}{n_{\text{max}}}. (4) MFS generates failure-aware outliers by partially swapping cross-modal features. Such swapping preserves intra-modality semantics while disrupting cross-modal consistency, capturing a critical and common failure mode in multimodal systems. Figure 4 illustrates a t-SNE visualization of the embedding space under different nswapn_{\text{swap}} values. For small nswapn_{\text{swap}}, the generated outliers (red) lie close to the ID clusters (blue), acting as hard negatives. As nswapn_{\text{swap}} increases, the outliers gradually move farther from the ID manifold, confirming that MFS provides a controllable mechanism for generating diverse and realistic failure cases. This property is particularly valuable for training models that must remain sensitive to subtle misclassification signals, especially in multimodal scenarios where errors often stem from partial or conflicting evidence. By introducing corrupted or ambiguous multimodal outliers, MFS encodes the prior knowledge of what is uncertain and should be assigned low confidence, thereby teaching the model to recognize broader patterns of uncertainty and enhancing its robustness in detecting real-world misclassifications. Additional discussion on MFS is provided in the Appendix. Overall, MFS offers a simple, generalizable, and computationally efficient approach to simulating realistic failure cases for multimodal failure detection without requiring external data. The loss for the synthetic outliers is defined as: ℒoutlier=CE(p^o,𝐲swapped),\mathcal{L}_{\text{outlier}}=\mathrm{CE}(\hat{p}_{o},\mathbf{y}_{\text{swapped}}), (5) where CE\mathrm{CE} denotes the cross-entropy loss. The final training objective integrates all components: ℒtotal=ℒcls+ℒoutlier+λaclℒacl,\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{cls}}+\mathcal{L}_{\text{outlier}}+\lambda_{\text{acl}}\mathcal{L}_{\text{acl}}, (6) where ℒcls\mathcal{L}_{\text{cls}} is the cross-entropy loss for the original training samples, and λacl\lambda_{\text{acl}} is a hyperparameter that balances the influence of ℒacl\mathcal{L}_{\text{acl}}. 2.6 Inference Our method focuses on detecting misclassified samples within known classes. Therefore, during the test phase, evaluation is performed exclusively on the original CC classes. Specifically, for a given input 𝐱\mathbf{x}, the predicted label is y^=argmaxy∈𝒴p^\hat{y}=\mathop{\mathrm{argmax}}_{y\in\mathcal{Y}}\hat{p}, and the corresponding confidence is determined using the common MSP score, i.e., κ(𝐱)=maxy∈𝒴⁡p^\kappa(\mathbf{x})=\max_{y\in\mathcal{Y}}\hat{p}. 3 Experiments 3.1 Experimental Setup Datasets. We evaluate our proposed framework on four action recognition datasets sourced from the MultiOOD benchmark [16]: HMDB51 [34], Kinetics-600 [31], HAC [15], and EPIC-Kitchens [11]. Each of these datasets incorporates video and optical flow modalities. For the HAC dataset, we also include evaluations utilizing the audio modality. Further details on each dataset are in the Appendix. Implementation. We conduct experiments across three modalities: video, audio, and optical flow. The MMAction2 [9] toolkit is adopted for all experiments. To encode visual information, we utilize the SlowFast network [17], initialized with weights pre-trained on the Kinetics-400 dataset [31]. For the audio encoder, we employ a ResNet-18 architecture [24], with weights initialized from the VGGSound pre-trained checkpoint [6]. Similarly, the optical flow encoder uses the SlowFast network, configured with a slow-only pathway and also leveraging pre-trained weights from Kinetics-400 [31]. The Adam optimizer [33] is used for model training, with a learning rate of 0.00010.0001 and a batch size of 1616. The hyperparameters for our proposed method are set as follows: λacl=2.0\lambda_{\text{acl}}=2.0, nmin=32n_{\text{min}}=32, nmax=256n_{\text{max}}=256. We train the models for 5050 epochs on an NVIDIA RTX 3090 GPU and select the model with the best performance on the validation dataset. Baselines. We compare our approach against several standard confidence-scoring functions, including MSP [26], MaxLogit [25], Energy [41], and Entropy [5]. Additionally, we adapt unimodal FD methods for our framework, including DOCTOR [21] and OpenMix [73], along with the outlier synthesis techniques Mixup [68], RegMixup [48]. We also include established training strategies, namely CRL [45] and A

NVIDIA Research

Papers on Lattice

Total citations

Topics

Research focus

Eval Frameworks & Benchmarks (1)Multimodal Models (1)Red-Teaming & Adversarial Robustness (1)

Frequent co-authors

Moru Liu (1)

Papers (1)

Mar 2, 2026

Mar 2, 2026·also ETH, NVIDIA

Adaptive Confidence Regularization for Multimodal Failure Detection

Multimodal models often exhibit lower confidence than their unimodal counterparts when they're about to fail, and this work leverages that insight to build a better failure detector.

Moru Liu, Mario Trapp

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Search

Mario Trapp

Research focus

Frequent co-authors

Papers (1)