Inhyeok Choi

Ulsan National Institute of Science and Technology, South Korea, C𝐅i)\mathbf{f}_{\text{out}}=\text{BatchNorm}\left(\frac{1}{C}\sum_{i=1}^{C}\mathbf{F}_{i}\right) (19) Gated Linear Unit (GLU) Fusion: A gating mechanism that learns to suppress noise and control information flow. The gating mechanism can learn to suppress noisy or irrelevant feature dimensions while preserving informative ones, acting as a learned noise filter. 𝐜=𝐱s+𝐱t\mathbf{c}=\mathbf{x}_{s}+\mathbf{x}_{t} (20) 𝐠=σ(𝐖g[𝐱s∥𝐱t]+𝐛g)∈ℝB×D\mathbf{g}=\sigma\left(\mathbf{W}_{g}[\mathbf{x}_{s}\|\mathbf{x}_{t}]+\mathbf{b}_{g}\right)\in\mathbb{R}^{B\times D} (21) 𝐟out=BatchNorm(𝐜⊙𝐠)\mathbf{f}_{\text{out}}=\text{BatchNorm}(\mathbf{c}\odot\mathbf{g}) (22) where σ\sigma is the sigmoid function, ⊙\odot denotes element-wise multiplication, and 𝐖g∈ℝD×2D\mathbf{W}_{g}\in\mathbb{R}^{D\times, D tensor of shape (C,F,T)(C,F,T) per trial. For the purposes of our comparative analysis, we refer to the standalone Spectral Encoder Network as SPEN, while ASPEN represents the full hybrid framework utilizing multiplicative fusion. These power spectrograms are fed into the SPEN convolutional blocks, where the chosen window and hop sizes control the trade-off between temporal resolution Δt=(nperseg−noverlap)/fs\Delta t=(n_{\mathrm{perseg}}-n_{\mathrm{overlap}})/f_{s} and frequency resolution Δf=fs/nfft\Delta f=f_{s}/n_{\mathrm{fft}}, allowing ASPEN to capture task-relevant harmonic structure for SSVEP, evoked components for P300, and μ/β\mu/\beta band dynamics for MI in a unified spectral representation. 2.4 Model Architecture Figure 3: Detailed view of temporal stream, spectral stream, and multiplicative fusion components. The high-level architecture is shown in Figure 1. ASPEN consists of two primary components: a Temporal Stream and a Spectral Stream. Given a raw EEG trial 𝐗∈ℝC×T\mathbf{X}\in\mathbb{R}^{C\times T}, the two complementary inputs are the raw signal 𝐗time\mathbf{X}_{\text{time}} for the temporal stream and the per-channel STFT magnitude spectrograms 𝐗spec∈ℝC×F×T′\mathbf{X}_{\text{spec}}\in\mathbb{R}^{C\times F\times T^{\prime}} for the spectral stream. A detailed view of the two streams and the fusion mechanism are illustrated in Figure3. The spectral stream extracts frequency-time patterns through a two-stage CNN with squeeze-and-excitation (SE) attention (Hu et al., 2018) and residual blocks. SE modules adaptively recalibrate channel responses by learning to emphasize informative spectral patterns, while residual connections improve gradient flow. After two stages of convolution, SE attention, and pooling, features are projected and averaged across EEG channels to yield 𝐱s∈ℝd\mathbf{x}_{s}\in\mathbb{R}^{d}. The temporal stream follows an EEGNet-inspired design (Lawhern et al., 2018) where the temporal convolution learns frequency-specific filters analogous to bandpass filtering, the depthwise spatial convolution learns channel combinations analogous to CSP (Ang et al., 2008), and the separable convolutions efficiently extract higher-order features. The output is projected to 𝐱t∈ℝd\mathbf{x}_{t}\in\mathbb{R}^{d}. We combine stream representations via element-wise multiplication after learned linear projections 𝐳=(𝐖s𝐱s)⊙(𝐖t𝐱t)\mathbf{z}=(\mathbf{W}_{s}\mathbf{x}_{s})\odot(\mathbf{W}_{t}\mathbf{x}_{t}) (1) where 𝐖s,𝐖t∈ℝd×d\mathbf{W}_{s},\mathbf{W}_{t}\in\mathbb{R}^{d\times d} are learnable matrices and ⊙\odot denotes the Hadamard product. This multiplicative interaction acts as cross-modal gating. Dimension ziz_{i} is large only when both streams produce strong activations, effectively requiring agreement between spectral and temporal evidence. Features that appear prominently in only one view, which often indicative of artifacts or noise, are suppressed. The fused representation passes through batch normalization and a linear classifier. The model is optimized using task-specific loss functions. Binary cross-entropy with logits (ℒBCE\mathcal{L}_{BCE}) are used for two-class paradigms like P300, which includes automated positive-weight scaling to mitigate class imbalance. For multi-class tasks such as SSVEP and Motor Imagery, standard cross-entropy loss (ℒCE\mathcal{L}_{CE}) is employed. The learned weights wsw_{s} and wtw_{t} provide an interpretable measure of each modality’s contribution, enabling an analysis of which representation the model prioritizes for different paradigms and individual trials. 3 Experiments 3.1 Baselines To evaluate the performance of our proposed method, we benchmarked against five baselines that emphasize cross-subject generalization and novel data representations. EEGNet (Lee et al., 2019) serves as a compact convolutional baseline, leveraging depthwise and separable convolutions to efficiently extract spatial and frequency-specific features with minimal parameters. To model global dependencies, EEGConformer (Song et al., 2022) adopts a hybrid design that combines CNNs for local feature extraction with Transformer modules for long-range temporal modeling. CTNet (Zhao et al., 2024) is included for its emphasis on cross-task and cross-subject robustness, utilizing domain-invariant representations to mitigate EEG non-stationarity. TSformer-SA (Li et al., 2025) integrates temporal and spectral features through cross-view self-attention, enabling joint modeling of time-domain signals and wavelet-based time-frequency representations for improved cross-subject decoding. Finally, MultiDiffNet (Zhang et al., 2025) incorporates multi-scale differential transformations of the input signal to better capture complex distributions and enhance training stability in noisy data environments. All models were evaluated using identical training schedules and hyperparameters unless otherwise specified by architectural constraints. The performance of these baselines are shown in Table 2. 3.2 Ablations Table 1: Ablation study summary. Best STFT parameters and fusion strategy per dataset, selected by unseen-subject accuracy. Best Acc = best fusion accuracy on held-out subjects (%), Mult Acc = multiplicative fusion accuracy (%), Δ\Delta = absolute difference. Global Attn = Global Attention, Bilinear = Low-rank Bilinear. Dataset nperseg noverlap nfft Best Fusion Best Acc Mult Acc Δ\Delta Wang 2016 SSVEP 256 128 256 Global Attn 72.76 69.47 -3.3 Lee2019 SSVEP 256 128 1024 Bilinear 86.68 85.71 -1.0 BI2014b P300 32 16 512 Bilinear 73.52 66.12 -7.4 BNCI2014-009 P300 128 120 256 Multiplicative 89.82 89.82 – BNCI2014-001 MI 512 256 512 Multiplicative 30.73 30.73 – Lee2019 MI 32 30 32 Multiplicative 75.70 75.70 – STFT Settings: Since SPEN operates on time-frequency representations, we first optimized the STFT frontend through a controlled ablation study. For each task, we evaluated 27 configurations by sweeping three values per parameter: window length (nperseg), overlap ratio (0.50, 0.75, 0.9375), and FFT size (nfft). Window lengths were defined in a task-aware manner (half, default, and maximum resolution), constrained to never exceed trial length; noverlap was derived from the overlap ratio subject to noverlap << nperseg; and nfft was drawn from a pool containing the smallest power of two ≥\geq nperseg and the task default, with nfft ≥\geq nperseg enforced. For each configuration, we trained the same SPEN backbone and training protocol to isolate the effect of STFT settings, then evaluated on validation and test splits using accuracy and F1/recall (plus ROC-AUC and PR-AUC for imbalanced binary P300 tasks). For binary tasks, we re-optimized the decision threshold on the validation set (F1-maximizing sweep) and applied it to test evaluation; for P300 tasks we additionally used a WeightedRandomSampler to balance training batches. The best STFT setting was selected per task (F1 for P300, accuracy otherwise), and the top 3 STFT configurations were carried forward into subsequent fusion ablations to avoid confounding fusion comparisons with suboptimal preprocessing. Full details are provided in Appendix B.1. Fusion Strategies: We evaluated seven fusion strategies for combining the temporal and spectral streams, drawing from foundational methods in multimodal learning (Liang et al., 2024a): (1) static equal weighting, (2) global attention with learned trial-level weights, (3) spatial attention with per-channel weighting, (4) gated linear units (GLU) for noise suppression, (5) element-wise multiplicative fusion, (6) low-rank bilinear pooling, and (7) multi-head cross-attention between streams. Each strategy was evaluated across all benchmark tasks using the top 3 best performing STFT parameters from our spectrogram ablation. While optimal fusion varied by dataset, multiplicative fusion achieved the highest unseen-subject accuracy on three of six tasks (BNCI2014 P300, BNCI2014-001 MI, Lee2019 MI) and remained competitive on two others (within 1% on Wang2016 SSVEP and 3.3% on Lee2019 SSVEP). Bilinear fusion outperforms on BI2014b P300 by 7.4%, but we prioritize cross-paradigm consistency over peak single-task performance. Given multiplicative fusion’s stable cross-paradigm performance and its alignment with our cross-modal gating hypothesis (Equation 1), we adopt it as the unified fusion strategy for ASPEN. This selection also determined which STFT configuration to use for final evaluation. Best fusion method and STFT parameters are summarized in Table 1. Full mathematical details and per-task results are provided in Appendix C.1. 3.3 Results Table 2: Final results across tasks and models. Cross-subject generalization accuracy (%). Bold indicates best cross-subject performance per dataset. Mean ± STD across three seeds. Method Task Dataset Cross- EEGNet EEGConf. MultiDiff. TSformer-SA CTNet SPEN ASPEN SSVEP Wang2016 Session 81.96±\pm6.02 56.32±\pm4.15 91.74±\pm1.62 47.96±\pm4.16 88.37±\pm4.20 85.71±\pm1.57 73.98±\pm1.46 Subject 74.25±\pm5.27 49.95±\pm2.81 87.95±\pm2.56 39.93±\pm5.25 83.60±\pm0.82 78.82±\pm6.41 67.20±\pm4.83 Lee2019 Session 95.04±\pm0.35 92.00±\pm0.94 93.67±\pm0.67 79.98±\pm6.60 95.36±\pm0.61 70.58±\pm0.80 95.50±\pm0.33 Subject 86.51±\pm0.09 81.99±\pm1.05 85.04±\pm0.76 63.38±\pm6.43 87.25±\pm0.69 58.51±\pm0.98 87.53±\pm0.29 P300 BI2014b Session 64.74±\pm2.71 78.84±\pm3.46 81.46±\pm4.56 82.25±\pm3.04 76.06±\pm6.45 84.15±\pm0.93 77.95±\pm5.52 Subject 62.64±\pm1.01 77.55±\pm4.59 80.95±\pm3.91 83.13±\pm0.35 74.55±\pm8.41 82.96±\pm0.64 77.01±\pm7.16 BNCI2014 Session 84.65±\pm0.62 84.25±\pm1.09 85.42±\pm0.89 86.75±\pm0.68 86.26±\pm0.38 78.31±\pm4.40 89.65±\pm0.48 Subject 84.05±\pm1.84 83.20±\pm1.29 84.91±\pm1.68 86.92±\pm1.06 85.28±\pm0.90 77.97±\pm4.66 88.57±\pm0.76 MI BNCI2014 Session 61.21±\pm3.27 56.35±\pm3.05 58.53±\pm4.46 35.91±\pm6.48 57.34±\pm7.68 33.63±\pm3.32 51.59±\pm8.40 Subject 37.29±\pm4.45 33.91±\pm7.33 29.05±\pm4.00 28.99±\pm4.82 36.29±\pm9.82 26.91±\pm3.69 32.00±\pm0.53 Lee2019 Session 78.89±\pm0.38 76.28±\pm0.91 76.54±\pm2.47 71.90±\pm0.74 77.15±\pm0.77 55.28±\pm0.45 77.93±\pm0.52 Subject 75.88±\pm1.98 74.55±\pm1.39 74.67±\pm1.95 71.40±\pm2.61 74.98±\pm2.13 53.50±\pm1.69 76.27±\pm0.73 The performance of SPEN, ASPEN, and five baselines across six benchmark datasets is summarized in Table 2. Our results demonstrate that ASPEN achieves superior cross-subject generalization in three of the six evaluated datasets: Lee2019 SSVEP (87.53%), BNCI2014 P300 (88.57%), and Lee2019 MI (76.27%) datasets. Notably, on the BNCI2014 P300 task, ASPEN outperforms TSformer-SA by nearly 2%, despite the latter being specifically designed for evoked potential decoding. This suggests that our multiplicative gating mechanism is more effective at filtering the inter-subject noise prevalent in large-scale P300 datasets. We also observe that while TSformer-SA performs competitively on P300-like tasks (83.13% on BI2014b), its performance degrades significantly on SSVEP and MI tasks (39.93% on Wang2016 SSVEP). In contrast, ASPEN maintains competitive performance across all three tasks. This indicates that multiplicative fusion acts as a universal architectural prior that adapts to the specific spectral-temporal demands of the underlying neural signal. The standalone spectral encoder (SPEN) struggles on Motor Imagery tasks, achieving only 26.91% on BNCI2014-001 MI (barely above the 25% chance level) and 53.50% on Lee2019 MI. These results reveal that cross-subject stability and discriminative power are distinct properties. While spectral representations are more consistent across individuals, Motor Imagery classification relies on precise temporal dynamics of sensorimotor rhythms that are lost in the STFT magnitude representation. The performance recovery from SPEN to ASPEN on Lee2019 MI (53.50% to 76.27%) demonstrates that neither modality alone suffices and that cross-modal fusion is essential for robust generalization. 4 Analysis Our experimental results demonstrate that ASPEN achieves superior or competitive cross-subject generalization compared to state-of-the-art baselines. To understand the drivers of this performance, we analyze the impact of multiplicative fusion, the paradigm-specific reliance on spectral versus temporal features, and the role of spectral stability. 4.1 Mechanism of Multiplicative Spectral-Temporal Gating Figure 4: Stream contributions and feature correlation (ρ\rho) across datasets. Low correlation values confirm that streams capture distinct information. To investigate how ASPEN leverages dual-stream information, we analyze the features during inference. The fused representation is defined as: 𝐳fused=projS(𝐱S)⊙projT(𝐱T)\mathbf{z}_{\text{fused}}=\text{proj}_{S}(\mathbf{x}_{S})\odot\text{proj}_{T}(\mathbf{x}_{T}) (2) where 𝐱S\mathbf{x}_{S} and 𝐱T\mathbf{x}_{T} denote the spectral and temporal features respectively, and ⊙\odot represents element-wise multiplication. We quantify the relative spectral magnitude wSw_{S} via the normalized L2 norm of the projected features: wS=‖projS(𝐱S)‖2‖projS(𝐱S)‖2+‖projT(𝐱T)‖2w_{S}=\frac{\|\text{proj}_{S}(\mathbf{x}_{S})\|_{2}}{\|\text{proj}_{S}(\mathbf{x}_{S})\|_{2}+\|\text{proj}_{T}(\mathbf{x}_{T})\|_{2}} (3) with wT=1−wSw_{T}=1-w_{S} representing the relative temporal magnitude. Stream complementarity is measured through the feature correlation ρ\rho, defined as the cosine similarity between the projected features: ρ=⟨projS(𝐱S),projT(𝐱T)⟩‖projS(𝐱S)‖2⋅‖projT(𝐱T)‖2\rho=\frac{\langle\text{proj}_{S}(\mathbf{x}_{S}),\text{proj}_{T}(\mathbf{x}_{T})\rangle}{\|\text{proj}_{S}(\mathbf{x}_{S})\|_{2}\cdot\|\text{proj}_{T}(\mathbf{x}_{T})\|_{2}} (4) As illustrated in Figure 4, ASPEN adaptively shifts its reliance on spectral vs. temporal features based on the task at hand. The P300 task (BI2014b) exhibits strong spectral dominance (wS=89.9%w_{S}=89.9\%), while SSVEP tasks (Wang2016, Lee2019) shift toward the temporal stream, with wTw_{T} reaching 64.8%64.8\% and 71.0%71.0\%, respectively. Motor imagery datasets show a spectrally-biased distribution (wS≈73%w_{S}\approx 73\%–79%79\%). Across all datasets, the low correlation values (0.15≤ρ≤0.310.15\leq\rho\leq 0.31) confirm that the streams capture distinct, non-redundant information. This architecture functions as a strict cross-modal gating mechanism. Unlike additive fusion, where high-magnitude artifacts in one modality can bias the decision boundary, our multiplicative approach acts as a logical AND gate. A feature is only activated in the fused representation if it receives concurrent support from both streams. Consequently, transient artifacts, such as muscle noise that appears in the temporal domain but lacks spectral consistency, are naturally suppressed. As evidenced in Table 1, forcing this cross-view agreement encourages the model to prioritize features robust to the phase shifts and amplitude variations inherent in cross-subject transfer. 4.2 Visualizing Decision Boundaries Figure 5: Grad-CAM visualization of feature importance for P300 classification. Correct prediction (top) shows focused attention on physiologically relevant low-frequency bands. Misclassification (bottom) reveals scattered attention towards high-frequency noise artifacts. To validate the interpretability of our framework and justify the necessity of cross-modal fusion, we visualized the learned features using Grad-CAM Selvaraju et al. (2017). We analyzed the spectral regions contributing most to the model’s decisions in both successful and failed prediction scenarios on the P300 dataset. Fig.5 illustrates the Grad-CAM activation maps for two representative samples. As shown in the top row of Fig. 5, when the model correctly identifies the target class with high confidence, the activation hotspot is tightly concentrated in the low-frequency band and specific temporal windows. This aligns perfectly with neurophysiological knowledge, as P300 components are primarily characterized by low-frequency energy deflections. The model successfully ignores high-frequency background activity, confirming that it has learned robust, physiologically valid features. Conversely, the bottom row of Fig. 5 shows a misclassified sample with low confidence. Here, the model’s attention is fragmented and scattered across high-frequency bands, likely driven by muscle artifacts or instrument noise rather than neural signals. This distraction by high-frequency noise highlights the vulnerability of single-stream interactions where artifactual high-amplitude spikes can propagate to the decision layer. 5 Conclusion and Outlook In this work, we introduced ASPEN, a multimodal framework designed to overcome the challenges of cross-subject generalization in EEG-based BCIs. By leveraging the inherent stability of spectral representations, ASPEN utilizes a multiplicative fusion mechanism to enforce cross-modal agreement. Our experiments across six benchmark datasets show that this approach effectively suppresses non-neural artifacts and prioritizes robust features, achieving the best unseen-subject performance on three datasets spanning SSVEP, P300, and Motor Imagery paradigms. While ASPEN significantly reduces the performance gap for new users, several avenues for future research remain. Future work will investigate automated configuration optimization, perhaps through learnable time-frequency transforms, to move toward a truly ”one-size-fits-all” zero-shot model. Additionally, we aim to explore the integration of self-supervised pre-training on large multi-subject EEG corpora to further enhance the richness of the shared latent space. Acknowledgments We would like to thank Professor Bhiksha Raj of Carnegie Mellon University for his guidance and support throughout this project. This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (RS-2022-00143911, AI Excellence Global Innovative Leader Education Program). References K. K. Ang, Z. Y. Chin, H. Zhang, and C. Guan (2008) Filter bank common spatial pattern (fbcsp) in brain-computer interface. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pp. 2390–2397. Cited by: §1.1, §1, §2.4. P. Aricò, F. Aloise, F. Schettini, S. Salinari, D. Mattia, and F. Cincotti (2014) Influence of p300 latency jitter on event related potential-based brain–computer interface performance. Journal of neural engineering 11 (3), pp. 035008. Cited by: §2.1. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17 (59), pp. 1–35. Cited by: §1.1. J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §2.4. T. M. Ingolfsson, M. Hersche, X. Wang, N. Kobayashi, L. Cavigelli, and L. Benini (2020) EEG-tcnet: an accurate temporal convolutional network for embedded motor-imagery brain–machine interfaces. In 2020 IEEE international conference on systems, man, and cybernetics (SMC), pp. 2958–2965. Cited by: §1.1. L. Korczowski, E. Ostaschenko, A. Andreev, G. Cattan, P. L. C. Rodrigues, V. Gautheret, and M. Congedo (2019) Brain invaders solo versus collaboration: multi-user p300-based brain-computer interface dataset (bi2014b). Ph.D. Thesis, GIPSA-lab. Cited by: §2.1. V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance (2018) EEGNet: a compact convolutional neural network for eeg-based brain–computer interfaces. Journal of neural engineering 15 (5), pp. 056013. Cited by: §1.1, §1, §2.4. M. Lee, O. Kwon, Y. Kim, H. Kim, Y. Lee, J. Williamson, S. Fazli, and S. Lee (2019) EEG dataset and openbmi toolbox for three bci paradigms: an investigation into bci illiteracy. GigaScience 8 (5), pp. giz002. Cited by: §2.1, §2.3, §3.1. X. Li, W. Wei, S. Qiu, and H. He (2025) A temporal–spectral fusion transformer with subject-specific adapter for enhancing rsvp-bci decoding. Neural Networks 181, pp. 106844. Cited by: §1.1, §1, §3.1. Y. Li, L. Guo, Y. Liu, J. Liu, and F. Meng (2021) A temporal-spectral-based squeeze-and-excitation feature fusion network for motor imagery eeg decoding. IEEE Transactions on Neural Systems and Rehabilitation Engineering 29, pp. 1534–1545. Cited by: §1.1, §1. P. P. Liang, A. Zadeh, and L. Morency (2024a) Foundations & trends in multimodal machine learning: principles, challenges, and open questions. ACM Computing Surveys 56 (10), pp. 1–42. Cited by: §3.2. S. Liang, L. Li, W. Zu, W. Feng, and W. Hang (2024b) Adaptive deep feature representation learning for cross-subject eeg decoding. BMC bioinformatics 25 (1), pp. 393. Cited by: §1. K. Liu, M. Yang, Z. Yu, G. Wang, and W. Wu (2022) FBMSNet: a filter-bank multi-scale convolutional neural network for eeg-based motor imagery decoding. IEEE Transactions on Biomedical Engineering 70 (2), pp. 436–445. Cited by: §1.1. W. Lu, X. Zhang, L. Xia, H. Ma, and T. Tan (2024) Domain adaptation spatial feature perception neural network for cross-subject eeg emotion recognition. Frontiers in Human Neuroscience 18, pp. 1471634. Cited by: §1. J. Luo, W. Cui, S. Xu, L. Wang, X. Li, X. Liao, and Y. Li (2023) A dual-branch spatio-temporal-spectral transformer feature fusion network for eeg-based visual recognition. IEEE Transactions on Industrial Informatics 20 (2), pp. 1721–1731. Cited by: §1.1. R. Mane, N. Robinson, A. P. Vinod, S. Lee, and C. Guan (2020) A multi-view cnn with novel variance layer for motor imagery brain computer interface. In 2020 42nd annual international conference of the IEEE engineering in medicine & biology society (EMBC), pp. 2950–2953. Cited by: §1.1, §1. S. Morales and M. E. Bowers (2022) Time-frequency analysis methods and their application in developmental eeg data. Developmental Cognitive Neuroscience 54, pp. 101067. External Links: Document Cited by: §1.1. Y. K. Musallam, N. I. AlFassam, G. Muhammad, S. U. Amin, M. Alsulaiman, W. Abdul, H. Altaheri, M. A. Bencherif, and M. Algabri (2021) Electroencephalography-based motor imagery classification using temporal convolutional network fusion. Biomedical Signal Processing and Control 69, pp. 102826. Cited by: §1.1. Y. Roy, H. Banville, I. Albuquerque, A. Gramfort, T. H. Falk, and J. Faubert (2019) Deep learning-based electroencephalography analysis: a systematic review. Journal of neural engineering 16 (5), pp. 051001. Cited by: §1. R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard, and T. Ball (2017) Deep learning with convolutional neural networks for eeg decoding and visualization. Human brain mapping 38 (11), pp. 5391–5420. Cited by: §1.1. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §4.2. Y. Song, Q. Zheng, B. Liu, and X. Gao (2022) EEG conformer: convolutional transformer for eeg decoding and visualization. IEEE Transactions on Neural Systems and Rehabilitation Engineering 31, pp. 710–719. Cited by: §1.1, §1, §3.1. M. Tangermann, K. Müller, A. Aertsen, N. Birbaumer, C. Braun, C. Brunner, R. Leeb, C. Mehring, K. J. Miller, G. R. Müller-Putz, et al. (2012) Review of the bci competition iv. Frontiers in neuroscience 6, pp. 55. Cited by: §2.1. Z. Wan, R. Yang, M. Huang, N. Zeng, and X. Liu (2021) A review on transfer learning in eeg signal analysis. Neurocomputing 421, pp. 1–14. Cited by: §1. J. Wang, L. Yao, and Y. Wang (2023) IFNet: an interactive frequency convolutional neural network for enhancing motor imagery decoding from eeg. IEEE Transactions on Neural Systems and Rehabilitation Engineering 31, pp. 1900–1911. Cited by: §1.1. Y. Wang, X. Chen, X. Gao, and S. Gao (2016) A benchmark dataset for ssvep-based brain–computer interfaces. IEEE Transactions on Neural Systems and Rehabilitation Engineering 25 (10), pp. 1746–1752. Cited by: §2.1, §2.3. K. Zhang, N. Robinson, S. Lee, and C. Guan (2021) Adaptive transfer learning for eeg motor imagery classification with deep convolutional neural network. Neural Networks 136, pp. 1–10. Cited by: §1.1. M. Zhang, K. Shapovalenko, Y. Shao, E. Guo, and P. Pradhan (2025) MultiDiffNet: a multi-objective diffusion framework for generalizable brain decoding. Note: Preprint: arXiv:2511.18294 Cited by: §1.1, §3.1. W. Zhao, X. Jiang, B. Zhang, S. Xiao, and S. Weng (2024) CTNet: a convolutional transformer network for eeg-based motor imagery classification. Scientific reports 14 (1), pp. 20237. Cited by: §1.1, §3.1. Appendix A Dataset Specifications and Splitting To evaluate model robustness across datasets, we employed a multi-tier splitting strategy with a 60/20/20 ratio for subjects included in training. For each task, a subset of subjects was designated as “seen” and their data was partitioned into three splits. The Training split (60%) was used for model optimization, the Validation split (20%) for hyperparameter tuning, and Test 1 (20%) for cross-session evaluation. Test 1 specifically evaluates the model’s ability to generalize across different recording sessions from the same individuals, thereby assessing robustness to within-subject temporal variations. To assess zero-shot generalization capabilities, a separate cohort of subjects was entirely withheld from the training process. Test 2 (Cross-Subject) evaluates model performance on individuals with previously unencountered physiological profiles, providing a rigorous test of subject-independent performance. Table 3: EEG Dataset Specifications and Preprocessing Parameters. Z-score normalization was applied to all datasets to standardize signal amplitudes across different subjects and sessions, ensuring that high-voltage artifacts or individual physiological variations do not disproportionately influence model training. Task Dataset Subj. Chan. Cls. Bandpass Epoch SR SSVEP: Frequency- Wang2016 35 64 26 6–90 Hz 1.0s 250 Hz tagged visual decoding Lee2019 54 62 4 6–90 Hz 1.0s 250 Hz P300: Binary target BI2014b 38 32 2 0.1–30 Hz 1.0s 256 Hz vs. non-target ERP BNCI2014_009 10 16 2 1–24 Hz 1.0s 256 Hz MI: Motor BNCI2014_001 9 22 4 4–40 Hz 4.0s 250 Hz imagery decoding Lee2019 54 22 2 4–40 Hz 4.0s 250 Hz B STFT Ablation B.1 STFT Methods STFT-based Spectral Representation. For each EEG trial, we construct a spectral (time–frequency) representation by applying the Short-Time Fourier Transform (STFT) independently to each channel. Let xc[n]x_{c}[n] denote the preprocessed discrete-time signal of channel c∈{1,…,C}c\in\{1,\dots,C\} with sampling rate fsf_{s} (Hz), and let w[m]w[m] be a Hann window of length npersegn_{\mathrm{perseg}}. We compute the complex STFT as Zc(fk,tℓ)=∑m=0nperseg−1xc[ℓH+m]w[m]e−j2πkm/nfft,Z_{c}(f_{k},t_{\ell})=\sum_{m=0}^{n_{\mathrm{perseg}}-1}x_{c}[\ell H+m]\;w[m]\;e^{-j2\pi km/n_{\mathrm{fft}}}, (5) where nfftn_{\mathrm{fft}} is the FFT size, H=nperseg−noverlapH=n_{\mathrm{perseg}}-n_{\mathrm{overlap}} is the hop size (in samples), and (fk,tℓ)(f_{k},t_{\ell}) index frequency and time frames, respectively. In practice, we use the one-sided spectrum for real-valued signals, yielding F=⌊nfft/2⌋+

CMU Machine Learning

Papers on Lattice

Total citations

Topics

h-index