Architecture Design (Transformers, SSMs, MoE)
InfrastructureNovel neural network architectures including transformer variants, state space models, mixture of experts, and attention mechanisms.
Keywords
Top Labs in This Topic
Recent Papers
This paper introduces ChannelMamba, a novel end-to-end architecture for channel state information (CSI) prediction in 6G massive MIMO IoT systems, addressing the limitations of Transformers in handling high-dimensional, long-sequence channel data. ChannelMamba leverages a dual-domain input module processing both frequency-domain CSI and delay-domain CIR data, a cross-path parameter-sharing strategy for Mamba modules, and a bidirectional Mamba module with lightweight attention for cross-feature modeling. Experimental results demonstrate that ChannelMamba achieves state-of-the-art performance in channel prediction accuracy, robustness, generalization, and computational efficiency compared to existing methods.
Introduces ChannelMamba, a specialized Mamba-based architecture incorporating dual-domain input, cross-path parameter sharing, and bidirectional Mamba modules with attention, to achieve state-of-the-art performance in channel prediction for 6G MIMO-IoT.
This paper introduces a hybrid Mamba-Transformer (MT) framework for remote sensing image super-resolution, aiming to overcome the limitations of CNNs and transformers in capturing long-range dependencies and maintaining computational efficiency. MT combines a focused mamba block (FMB) with a snake vision state-space module (SVSSM) for global feature modeling and a pixel-adaptive block (PAB) for pixel-level multiscale enhancement. Experiments on benchmark datasets demonstrate that MT outperforms state-of-the-art methods, achieving a better trade-off between performance and computational cost, specifically reducing parameters and FLOPs compared to MambaIRv2 while improving PSNR.
Introduces a novel hybrid Mamba-Transformer architecture that leverages a snake vision state-space module within a Mamba block to improve long-range dependency modeling and reduce computational redundancy for remote sensing image super-resolution.
This paper introduces a model-hardware co-design framework for CNN-based SAR ATR that jointly optimizes adversarial robustness, model compression, and FPGA accelerator design. The framework uses hardware-guided structured pruning, informed by a hardware performance model, to explore robustness-efficiency trade-offs. Experiments on MSTAR and FUSAR-Ship datasets show the framework produces models up to 18.3x smaller with 3.1x fewer MACs while preserving robustness, and the FPGA implementation achieves significant latency and energy efficiency improvements compared to CPU/GPU baselines.
Develops a model-hardware co-design framework that unifies robustness-aware model compression and FPGA accelerator design for CNN-based SAR ATR, enabling exploration of robustness-efficiency trade-offs.
This paper reviews deep learning (DL) approaches for hepatocellular carcinoma (HCC) prediction, highlighting the need for efficient architectures to overcome computational limitations hindering real-world deployment. It discusses lightweight models like MobileNet and EfficientNet, model compression techniques, and data-efficient methods, as well as hybrid approaches to reduce computational load. The review emphasizes the importance of rigorous validation, bias audits, privacy-preserving strategies, and seamless integration into clinical workflows for safe and scalable clinical translation of DL-based HCC prediction.
Synthesizes current advances in efficient deep learning for HCC prediction, identifies persistent challenges, and provides guidance for developing clinically relevant and broadly deployable systems.
This paper introduces Hadamard Linear Attention (HLA), a novel linear attention mechanism designed to more accurately approximate softmax attention. HLA applies a nonlinearity after the computation of pairwise similarities, unlike existing linear attention methods that apply nonlinear kernel functions independently to queries and keys. The authors demonstrate that this approach results in a higher-degree rational function approximation of softmax and show its effectiveness in a large diffusion transformer model for video generation.
Introduces Hadamard Linear Attention (HLA), a linear attention variant that applies a nonlinearity after pairwise similarity computation to better approximate softmax.
The paper introduces Seq2Seq2Seq, a novel lossless compression method using a T5 language model architecture trained with reinforcement learning to compress data into discrete token sequences. This approach preserves the token-based structure of the original data, unlike autoencoders that use continuous latent spaces, leading to improved compression ratios. The model is trained using an off-policy reinforcement learning algorithm to optimize sequence length for minimal redundancy.
Introduces Seq2Seq2Seq, a lossless compression method that leverages reinforcement learning to train a T5 language model to compress data into discrete token sequences, preserving the original token structure.
The paper introduces Moonshine v2, an ergodic streaming encoder ASR model designed for latency-critical speech applications, particularly on resource-constrained edge devices. It addresses the latency issues of full-attention Transformer encoders by employing sliding-window self-attention, enabling bounded, low-latency inference while maintaining strong local context. Experiments demonstrate that Moonshine v2 achieves state-of-the-art word error rates on standard benchmarks, matching the accuracy of models six times larger while running significantly faster.
Introduces an ergodic streaming encoder ASR model, Moonshine v2, that uses sliding-window self-attention to achieve low-latency and high accuracy for on-device speech recognition.
This paper investigates the impact of differential privacy (DP) mechanisms, namely gradient clipping and noise injection, on firing rate statistics within federated spiking neural networks (SNNs). The study demonstrates that DP significantly perturbs firing rates, leading to rate shifts, attenuated aggregation, and unstable client selection in a speech recognition task under non-IID data. The authors further link these rate shifts to sparsity and memory usage, providing insights into the trade-offs between privacy and performance in rate-based federated neuromorphic learning.
Quantifies the sensitivity of firing rate-based federated spiking neural networks to differential privacy mechanisms, revealing specific impacts on rate statistics, aggregation, and client selection.
This paper introduces a continuous learning architecture for edge-based malware detection that leverages LoRA adapters to enable local adaptation and global knowledge sharing in resource-constrained environments. The approach fine-tunes lightweight transformer models (DistilBERT, DistilGPT-2, TinyT5) locally on edge devices and aggregates/redistributes only the LoRA modules, avoiding the exchange of raw data. Experiments on Edge-IIoTset and TON-IoT datasets demonstrate that this LoRA-based exchange improves accuracy by 20-25% when encountering unseen attacks, while maintaining stable performance and adding minimal overhead to model size.
Proposes a parameter-efficient continuous learning framework for edge-based malware detection that uses LoRA to facilitate knowledge sharing between edge devices without transmitting raw data.
This paper introduces an enhanced anonymity architecture based on the Loopix mix-network, tailored for the challenges of LEO satellite constellations and mixed-trust environments. The architecture incorporates a multi-path transport protocol using (n, k) erasure codes for reliability, a computationally efficient Private Information Retrieval (PIR) protocol for route discovery, and adaptive, centrality-based delay strategies to mitigate topological bias. Packet-level simulations validate the architecture, demonstrating near-zero message loss with the multi-path transport and quantifying the overhead of the PIR protocol, showing a practical anonymity-to-latency trade-off.
Introduces a novel anonymity architecture for LEO satellite constellations that integrates multi-path transport, PIR-based route discovery, and adaptive delay strategies to enhance reliability and privacy.
This paper presents a production-grade architecture for a distributed rate limiting system using Redis and Lua scripting, focusing on the trade-offs between accuracy and memory cost. It compares the Rolling Window algorithm's performance against Token Bucket and Fixed Window algorithms, demonstrating its accuracy with manageable memory overhead. The system employs a three-layer architecture for managing and updating rate-limiting rules, deployed on a Redis Cluster for availability and scalability.
Quantifies the accuracy and memory cost trade-off of the Rolling Window rate limiting algorithm compared to Token Bucket and Fixed Window algorithms within a production system.
The paper introduces RI-Mamba, a rotation-invariant state-space model for text-to-shape retrieval that addresses the limitations of existing methods in handling objects with arbitrary orientations and diverse categories. RI-Mamba disentangles pose from geometry using global and local reference frames and Hilbert sorting to create rotation-invariant token sequences. The model incorporates orientational embeddings via feature-wise linear modulation and employs cross-modal contrastive learning with automated triplet generation for scalable training, achieving state-of-the-art results on the OmniObject3D benchmark.
Introduces a novel rotation-invariant state-space model, RI-Mamba, for robust text-to-shape retrieval by disentangling pose from geometry and incorporating orientational embeddings.
The paper introduces ULTRA, a transformer-based recommendation architecture for Urdu, a low-resource language, to improve personalized news retrieval. ULTRA employs a dual-embedding architecture with a query-length aware routing mechanism to handle varying query lengths, directing queries to either title/headline-level or full-content pipelines. Experiments on a large Urdu news corpus demonstrate that ULTRA achieves over 90% precision compared to single-pipeline baselines, showing improved recommendation relevance.
Introduces a query-adaptive dual-embedding architecture for semantic content recommendation in low-resource languages, dynamically routing queries based on length to optimize retrieval relevance.
The paper introduces Multi-Level Compression Cross Networks (MLCC) and its multi-channel extension (MC-MLCC) to efficiently model high-order feature interactions in recommender systems. MLCC uses hierarchical compression and dynamic composition to capture feature dependencies with favorable computational complexity, while MC-MLCC decomposes feature interactions into parallel subspaces for efficient horizontal scaling. Experiments on public and industrial datasets demonstrate that MLCC and MC-MLCC outperform DLRM-style baselines, achieving up to 0.52 AUC improvement and up to 26x reduction in parameters and FLOPs, and the approach has been adopted in Bilibili's advertising system.
Introduces a novel feature interaction architecture, MLCC, that uses hierarchical compression and dynamic composition to efficiently capture high-order feature interactions, along with its multi-channel extension, MC-MLCC, for improved scalability.
The paper introduces a novel parameter-efficient fine-tuning (PEFT) method called \methodname{} that adapts large pretrained models by learning per-neuron thresholds and gains in activation space, inspired by neuromodulation. This approach aims to change the mode of computation by selecting and rescaling existing computations rather than rewriting weights, offering improved interpretability. Experiments on MNIST and rotated MNIST demonstrate that \methodname{} can improve accuracy over a frozen baseline with significantly fewer trainable parameters than LoRA, while also enabling neuron-level attribution and conditional computation.
Introduces \methodname{}, a parameter-efficient fine-tuning method that learns per-neuron thresholds and gains in activation space to adapt pretrained models by changing the mode of computation.
This paper introduces Hierarchical Sparse Autoencoders (HSAEs) to explicitly model the hierarchical relationships between features extracted from LLMs, addressing the limitation of standard SAEs that treat features in isolation. HSAEs incorporate a structural constraint loss and random feature perturbation to encourage alignment between parent and child features in the learned hierarchy. Experiments across various LLMs and layers demonstrate that HSAEs recover semantically meaningful hierarchies while preserving reconstruction fidelity and interpretability.
Introduces Hierarchical Sparse Autoencoders (HSAEs) to learn and represent the hierarchical relationships between features extracted from LLMs.
This paper addresses the instability issues in Rectified Flow (RF) inversion, which arise from accumulated approximation errors during the inversion process. They introduce Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it towards a running average of past velocities within a theoretically-motivated spherical Gaussian constraint. The authors further propose mimic-CFG, a velocity correction scheme for editing tasks that interpolates between the current velocity and its projection onto the historical average.
Introduces Proximal-Mean Inversion (PMI) and mimic-CFG, two novel, training-free methods to stabilize Rectified Flow inversion and improve image reconstruction and editing fidelity.
This paper extends crosscoder model diffing to cross-architecture comparisons, enabling the unsupervised discovery of behavioral differences between LLMs with different architectures. They introduce Dedicated Feature Crosscoders (DFCs), an architectural modification to improve the isolation of unique features in one model compared to another. Applying this technique, they identify features such as CCP alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B.
Introduces Dedicated Feature Crosscoders (DFCs), an architectural modification to enhance crosscoder model diffing for isolating features unique to individual models in cross-architecture comparisons.
This paper investigates the use of local vision-language models (VLMs) to improve fine-grained activity recognition in newborn resuscitation videos, comparing them to a TimeSformer baseline. The authors explored zero-shot VLM strategies and fine-tuned VLMs with LoRA on a simulated dataset of 13.26 hours of video. Fine-tuning a local VLM with LoRA achieved an F1 score of 0.91, outperforming the TimeSformer baseline (0.70), suggesting the potential of VLMs for this task.
Demonstrates that fine-tuning local vision-language models with LoRA can significantly improve activity recognition in newborn resuscitation videos compared to a TimeSformer baseline.
This paper introduces Microarchitecture Cliffs, a benchmark generation methodology to identify and attribute microarchitectural mismatches between architectural simulators and RTL implementations for model calibration. The Cliff methodology generates benchmarks that isolate individual microarchitectural features, enabling precise attribution of behavioral differences. Applying this methodology to calibrate XS-GEM5 against XS-RTL, the authors reduced performance error on Cliff benchmarks from 59.2% to 1.4% and improved performance prediction accuracy on SPEC2017 benchmarks.
Introduces a novel benchmark generation methodology, Microarchitecture Cliffs, for isolating and attributing microarchitectural discrepancies between simulators and RTL implementations, significantly improving simulator calibration accuracy.
This paper introduces the Task-Amortized Variational Autoencoder (TAVAE), a generative model of V1 activity, to investigate how task-specific priors are learned and deployed in the visual cortex. TAVAE extends the VAE framework to efficiently acquire new tasks by reusing previously learned representations, allowing for flexible adaptation of priors. By comparing TAVAE's posterior distributions with large-scale V1 recordings from mice performing a discrimination task, the study demonstrates that the visual system can rapidly learn and utilize task-specific contextual priors, reflected in bimodal response profiles when task statistics are violated.
Introduces the Task-Amortized Variational Autoencoder (TAVAE), a novel VAE architecture that enables efficient learning of task-specific priors by amortizing learning across tasks.
This paper addresses the challenge of unreliable read/write operations in Antiferromagnetic Tunnel Junction (AFMTJ) memories due to their ultrafast dynamics and low tunnel magnetoresistance (TMR). They propose a device-circuit co-design approach, specifically an asymmetric pulse driver (PD) for write operations and a self-timed sense amplifier (STSA) with dynamic trip-point tuning for read operations. Simulation results demonstrate improved read/write yield under process, voltage, and temperature (PVT) variations and 3D integration parasitics compared to standard MRAM front-ends, while preserving AFMTJ latency and energy benefits.
Introduces a device-circuit co-designed read/write interface, comprising an asymmetric pulse driver and a self-timed sense amplifier with dynamic trip-point tuning, to enhance the robustness of AFMTJ memories under realistic operating conditions.
The paper introduces WaveFormer, a transformer architecture tailored for biomedical signal classification, addressing limitations of standard transformers in capturing multi-scale frequency patterns in long sequences. WaveFormer incorporates wavelet decomposition in both the embedding construction via multi-channel DWT and positional encoding via Dynamic Wavelet Positional Encoding (DyWPE). Experiments across eight datasets for human activity recognition and brain signal analysis demonstrate WaveFormer's competitive performance by effectively integrating frequency-domain information.
Introduces a novel transformer architecture, WaveFormer, that integrates wavelet decomposition into both the embedding and positional encoding stages to improve biomedical signal classification.
This paper introduces a reciprocal-space generative pipeline for crystalline materials, representing crystals via a truncated Fourier transform of the species-resolved unit-cell density. This Fourier representation inherently handles periodic boundary conditions and crystallographic symmetries, while also supporting variable atomic multiplicities. The pipeline is instantiated using a transformer variational autoencoder and a latent diffusion model, demonstrating effective reconstruction and unconditional generation of crystal structures.
Introduces a novel reciprocal-space generative pipeline using Fourier transforms to represent and generate crystalline materials, inherently addressing periodicity, symmetry, and variable atomic multiplicities.
This paper investigates in-context learning in LLMs by framing it as Gaussian Process (GP) regression, using controlled experiments with function samples drawn from known GP priors. They compare LLM prediction error against empirical GP-regression (lower bound) and 1-NN (upper bound) baselines, finding that LLM learning curves approach the GP lower bound with increasing demonstrations. The authors also analyze LLM inductive biases via likelihood analysis, revealing a preference for less smooth GP kernels, and demonstrate that post-training can shift these biases to improve sample efficiency on smoother kernels.
Quantifies the extent to which LLMs behave like GP learners and provides methods for steering their inductive biases for continuous function learning tasks.
The paper introduces LRBTC, a modular LLM and VLM-driven architecture for quality control in pharmaceutical content, addressing the need for scalable and verifiable validation in regulated domains. LRBTC employs a Student-Teacher dual model architecture combined with a human-in-the-loop workflow and waterfall rule filtering. The approach achieves significant improvements on AIReg-Bench (83.0% F1, 97.5% recall) and CSpelling (26.7% accuracy improvement), demonstrating its effectiveness in reducing missed violations and improving content quality.
Introduces LRBTC, a novel LLM and VLM-driven quality control architecture that leverages a Student-Teacher dual model and HITL workflow for pharmaceutical content optimization.
This paper introduces a Collaborative Intrusion Detection System (CIDS) framework that dynamically optimizes the allocation of intrusion detectors across nodes in a layered network based on available resources and data types. The framework adapts to changing operational scenarios by reconfiguring detectors to maintain an optimal configuration without requiring heavy computation, making it suitable for edge device deployment. The evaluation, conducted using distributed datasets including a novel dataset based on a cyberattack targeting a ground drone, demonstrates the framework's ability to achieve adaptive and efficient intrusion detection.
Introduces a resource-aware CIDS framework that dynamically optimizes detector allocation in layered networks for efficient intrusion detection in resource-constrained environments.
The authors extend the Puzzle post-training neural architecture search framework to optimize the gpt-oss-120B model, creating gpt-oss-puzzle-88B, by combining heterogeneous MoE expert pruning, selective attention replacement, FP8 quantization, and post-training reinforcement learning. This optimized model achieves significant per-token throughput speedups (up to 2.82X on a single H100 GPU) while maintaining or slightly exceeding the parent model's accuracy across various benchmarks. The paper advocates for request-level efficiency metrics to account for varying token counts and demonstrates that gpt-oss-puzzle-88B improves request-level efficiency by up to 1.29X.
Introduces a pipeline combining heterogeneous MoE expert pruning, selective attention replacement, FP8 quantization, and post-training reinforcement learning within the Puzzle framework to optimize large language models for inference.
The paper introduces DeepGen 1.0, a 5B parameter unified multimodal model for image generation and editing, designed to be lightweight and efficient compared to larger models. To enhance semantic understanding in the compact model, they propose Stacked Channel Bridging (SCB) to extract and fuse hierarchical features from VLMs with learnable 'think tokens'. They also employ a three-stage data-centric training strategy, including alignment pre-training, joint supervised fine-tuning, and reinforcement learning with MR-GRPO, achieving state-of-the-art performance on benchmarks like WISE and UniREditBench while using only 50M training samples.
Introduces Stacked Channel Bridging (SCB), a novel deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to improve the generative backbone's semantic understanding and fine-grained control.
The paper introduces SParse Expert Synchronization (SPES), a decentralized training framework for Mixture-of-Experts (MoE) LLMs that reduces memory footprint by training only a subset of experts per node and periodically synchronizing them. This approach addresses the GPU memory limitations of existing decentralized training methods, which still require training the entire model on each node. The authors demonstrate that SPES enables training of 2B, 7B, and 9B parameter MoE models on resource-constrained hardware, achieving performance comparable to centrally trained LLMs with similar computational budgets.
Introduces SParse Expert Synchronization (SPES), a memory-efficient decentralized training framework that enables pretraining large MoE language models on distributed GPUs with limited memory.
The paper introduces LUVE, a latent-cascaded framework for ultra-high-resolution (UHR) video generation that tackles challenges in motion modeling, semantic planning, and detail synthesis. LUVE uses a three-stage architecture: low-resolution motion generation, latent upsampling, and high-resolution content refinement with dual frequency experts. Experiments demonstrate that LUVE achieves superior photorealism and content fidelity in UHR video generation compared to existing methods.
Introduces a novel latent-cascaded architecture with dual-frequency experts for generating ultra-high-resolution videos, improving both photorealism and content fidelity.
The paper introduces Variance Minimisation Policy Optimisation (VMPO) for diffusion alignment, framing the process as Sequential Monte Carlo and minimizing the variance of log importance weights instead of using a KL divergence objective. This approach is motivated by the SMC interpretation of diffusion alignment where the denoising model acts as a proposal and reward guidance induces importance weights. The authors demonstrate that minimizing the variance objective leads to the reward-tilted target distribution and recovers existing KL-based alignment methods under specific conditions, while also suggesting novel alignment strategies.
Introduces Variance Minimisation Policy Optimisation (VMPO) as a novel objective for diffusion alignment, minimizing the variance of log importance weights within an SMC framework.
The paper introduces Categorical Flow Maps, a flow-matching method designed for fast, few-step generation of categorical data using self-distillation. By defining a continuous flow map towards the simplex, the method transports probability mass to a predicted endpoint, enabling the use of distillation techniques and a novel endpoint consistency objective. Experiments demonstrate state-of-the-art few-step generation performance across images, molecular graphs, and text, even achieving strong results in single-step generation.
Introduces a continuous flow-matching formulation for categorical data generation that enables self-distillation and endpoint consistency training, leading to accelerated sampling.
The paper introduces a novel approach for irregular time series modeling by replacing Neural ODEs with a linear damped harmonic oscillator analogy that admits a closed-form solution, thereby avoiding computationally expensive numerical solvers. Keys and values are modeled as damped, driven oscillators, and the query is expanded in a sinusoidal basis, with attention modeled as a resonance phenomenon. The method is proven to maintain the universal approximation property of continuous-time attention and achieves state-of-the-art performance on irregular time series benchmarks with significant speedups.
Introduces a computationally efficient irregular time series model based on damped harmonic oscillators with closed-form solutions, demonstrating state-of-the-art performance and theoretical guarantees.
This paper introduces an ML-driven physical synthesis framework for RF circuits that addresses limitations of prior ML approaches by incorporating EM-accurate component models and routing capabilities. They trained a neural network on a large dataset of inductor geometries to predict Q-factor with high accuracy, enabling gradient-based layout optimization. The framework integrates a P-Cell optimizer and a placement/routing engine with EM spacing rules, resulting in DRC-aware GDSII layouts.
Introduces an end-to-end ML-driven framework for RF physical synthesis that generates manufacturable GDSII layouts by integrating EM-aware neural inductor modeling with intelligent placement and routing.
The paper introduces EqDeepRx, a deep-learning-aided MIMO receiver that combines linear processing with learned components for improved scaling and generalization. EqDeepRx employs a shared-weight DetectorNN operating on individual spatial streams to achieve near-linear complexity scaling with multiplexing order, and uses a DenoiseNN to enhance channel estimation. End-to-end simulations demonstrate that EqDeepRx achieves improved error rate and spectral efficiency compared to conventional receivers while maintaining low complexity and supporting various MIMO configurations without retraining.
Introduces a novel deep-learning-aided MIMO receiver architecture, EqDeepRx, that achieves near-linear complexity scaling with multiplexing order through a shared-weight DetectorNN and enhances generalization via a DenoiseNN.
The paper introduces U-Former ODE (UFO), a novel architecture for probabilistic forecasting of irregular time series data that combines U-Nets, Transformers, and Neural CDEs. UFO enables parallelizable computation and global receptive fields, addressing the scalability limitations of existing Neural CDE approaches. Experiments on five benchmarks demonstrate that UFO outperforms ten state-of-the-art baselines in predictive accuracy and achieves up to 15x faster inference, particularly on long and multivariate sequences.
Introduces a fully causal, parallelizable architecture, U-Former ODE (UFO), that integrates U-Nets, Transformers, and Neural CDEs for efficient and accurate probabilistic forecasting of irregular time series.
The paper introduces Trans-Chunk BiMamba (TC-BiMamba), a novel architecture for unified streaming and non-streaming automatic speech recognition (ASR) that addresses the limitations of existing BiMamba-based streaming methods which are restricted to fixed chunk sizes. TC-BiMamba employs a trans-chunk mechanism to train bidirectional sequences offline with dynamic chunk sizes, enabling a single model to handle both offline and streaming decoding with varying latency requirements. Experiments demonstrate that TC-BiMamba achieves a 1.3x training speedup, reduces memory consumption by 50%, and improves ASR performance compared to chunk-wise processing, while also outperforming U2++ and matching LC-BiMamba with a smaller model size.
Introduces the Trans-Chunk BiMamba (TC-BiMamba) architecture, enabling efficient dynamic chunk size training for unified streaming and non-streaming ASR.
This paper introduces a technical curriculum designed to enhance AI literacy within the language and translation (L&T) industry, covering vector embeddings, neural networks, tokenization, and transformer networks. The curriculum aims to cultivate computational thinking, algorithmic awareness, and agency among L&T professionals to improve their digital resilience. Evaluation in an MA course at TH Koeln suggests the curriculum's effectiveness, while also highlighting the need for additional lecturer support to maximize learning outcomes.
Proposes and evaluates a technical curriculum focused on language-oriented AI to improve AI literacy and digital resilience in the language and translation industry.
The paper analyzes Langevin dynamics with noise projected onto directions orthogonal to an isometric group action, a model relevant to understanding symmetry effects in stochastic gradient descent for over-parameterized models. The key finding is that when initial and target densities are group-invariant, this projected Langevin dynamics is equivalent in law to standard Langevin dynamics with isotropic diffusion but with an additional drift term related to the negative log volume of the group orbit. This equivalence is proven through a coupling argument involving a third process on the group, identifying the drift as the mean curvature of the orbits, thus revealing a novel form of implicit regularization.
Establishes an equivalence between Langevin dynamics with projected noise and standard Langevin dynamics with an additional drift term proportional to the negative log volume of the group orbit, revealing a novel form of implicit regularization.
This paper introduces an energy-aware spike budgeting framework for continual learning in spiking neural networks (SNNs) to address catastrophic forgetting while optimizing for energy efficiency. The framework combines experience replay, learnable LIF neuron parameters, and an adaptive spike scheduler to enforce dataset-specific energy constraints during training. Results show that spike budgeting acts as a sparsity-inducing regularizer on frame-based datasets, improving accuracy and reducing spike rates, while controlled budget relaxation enables accuracy gains on event-based datasets.
Introduces an energy-aware spike budgeting framework that adaptively controls spike rates during continual learning in SNNs to improve both accuracy and energy efficiency across frame-based and event-based neuromorphic vision datasets.
The paper introduces the Prototype Transformer (ProtoT), an autoregressive language model architecture that uses prototypes (parameter vectors) instead of self-attention to improve interpretability. ProtoT establishes two-way communication between the input sequence and the prototypes, causing the prototypes to capture nameable concepts during training and creating interpretable communication channels. Experiments demonstrate that ProtoT scales linearly with sequence length, performs well on text generation and downstream tasks (GLUE), and exhibits robustness to input perturbations while providing interpretable pathways for understanding robustness and sensitivity.
Introduces the Prototype Transformer, a novel autoregressive language model architecture designed for interpretability by using prototypes to capture nameable concepts and create interpretable communication channels.
This paper explores the use of Mamba-2 hybrid operators within Tiny Recursive Models (TRM) for abstract reasoning, motivated by Mamba-2's inherent iterative refinement properties. By replacing Transformer blocks in TRM with Mamba-2 hybrids while maintaining parameter parity, the authors demonstrate improved performance on the ARC-AGI-1 benchmark. Specifically, the Mamba-2 hybrid TRM achieves a +2.0% improvement in pass@2 and a +4.75% improvement in pass@100, suggesting enhanced candidate coverage.
Demonstrates that Mamba-2 hybrid operators can effectively replace Transformer blocks within Tiny Recursive Models, leading to improved performance on abstract reasoning tasks.
This paper investigates the phenomenon of "token overflow" in soft compression architectures for retrieval-augmented generation (RAG), where compressed token representations lose task-relevant information. They propose a methodology to characterize and detect token overflow, evaluating it within the xRAG framework. Their key finding is that lightweight probing classifiers, leveraging both query and context xRAG representations, achieve an average AUC-ROC of 0.72 in detecting overflow across HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating the importance of query-aware detection.
Introduces a methodology using lightweight probing classifiers to detect token overflow in compressed token representations for retrieval-augmented generation by leveraging query and context information.
The paper introduces Hi-SAM, a novel multi-modal recommendation framework designed to address limitations in semantic ID-based approaches, specifically suboptimal tokenization and architecture-data mismatch. Hi-SAM employs a Disentangled Semantic Tokenizer (DST) that uses geometry-aware alignment and coarse-to-fine quantization to separate shared and modality-specific semantics, and a Hierarchical Memory-Anchor Transformer (HMAT) that incorporates hierarchical positional encoding and anchor tokens to better model user-item interactions. Experiments on real-world datasets and a large-scale social platform demonstrate that Hi-SAM outperforms state-of-the-art baselines, particularly in cold-start scenarios, achieving a 6.55% improvement in a core online metric.
Introduces a hierarchical structure-aware multi-modal framework, Hi-SAM, that disentangles cross-modal semantics and modality-specific details during tokenization and incorporates hierarchical positional encoding within a transformer architecture for improved recommendation performance.
The paper introduces PrefillShare, an algorithm for sharing the prefill stage across multiple language models in disaggregated serving environments to reduce redundant computation and KV cache storage. PrefillShare factorizes models into prefill and decode modules, freezes the prefill module, and fine-tunes only the decode module, enabling multiple models to share a prefill module and its KV cache. Experiments demonstrate that PrefillShare achieves comparable accuracy to full fine-tuning while significantly improving latency (4.5x lower p95) and throughput (3.9x higher) in multi-model agent workloads.
Introduces PrefillShare, a novel algorithm that enables efficient sharing of the prefill stage and KV cache across multiple language models in a disaggregated serving system.
The paper introduces Empirical Gaussian Processes (GPs), a framework for constructing data-driven GP priors by empirically estimating the mean and covariance functions from historical observations. This approach overcomes limitations of handcrafted kernels, enabling the prior to reflect complex covariance structures present in the data. The authors derive an Expectation-Maximization algorithm with closed-form updates for learning the GP prior from independent datasets with heterogeneous observation locations, and demonstrate competitive performance on learning curve extrapolation and time series forecasting.
Introduces Empirical GPs, a novel method for learning GP priors directly from data by estimating the mean and covariance functions, thereby improving adaptability and reducing reliance on expert-defined kernels.
The paper introduces AssetFormer, an autoregressive Transformer model for generating modular 3D assets from text descriptions, addressing the need for high-quality, diverse assets in the digital industry. AssetFormer models the generation of 3D assets as a sequence of primitives with constrained design parameters, adapting module sequencing and decoding techniques from language models. Experiments using real-world modular assets demonstrate the model's effectiveness in streamlining asset creation for professional development and UGC scenarios.
Introduces an autoregressive Transformer-based architecture, AssetFormer, for generating modular 3D assets from textual descriptions by modeling the asset as a sequence of primitives.
This paper addresses the sample inefficiency of off-policy reinforcement learning by constraining the initial representations of input data to alleviate distribution shift. They introduce a novel framework, CIR, incorporating a Tanh activation function in the initial layer, normalization techniques, skip connections, and convex Q-learning. Theoretical analysis demonstrates the convergence of temporal difference learning with the Tanh function under linear function approximation, and empirical results show CIR achieves strong performance on continuous control tasks.
Introduces a Constrained Initial Representations (CIR) framework that improves off-policy RL sample efficiency by constraining initial representations using a Tanh activation, normalization, skip connections, and convex Q-learning.
This paper introduces U-DAVI, an uncertainty-aware amortized variational inference framework for image reconstruction that leverages diffusion priors. By injecting spatially adaptive perturbations to measurements during training, guided by uncertainty estimates, U-DAVI focuses learning on uncertain regions, improving reconstruction quality. Experiments on deblurring and super-resolution tasks demonstrate that U-DAVI achieves competitive or superior performance compared to existing diffusion-based methods, while maintaining computational efficiency.
Introduces an uncertainty-aware training strategy for amortized variational inference with diffusion priors, enabling improved image reconstruction by focusing learning on uncertain regions.

