Search papers, labs, and topics across Lattice.
Efficient training methods, optimizer design, learning rate schedules, mixed precision, and gradient techniques.
#8 of 24
1
Tucker Attention squeezes an order of magnitude more parameter efficiency out of attention layers, while unifying and simplifying Group Query Attention, Multi-Head Latent Attention, and standard Multi-Head Attention.
Forget hand-crafted features: DistilBERT can automatically identify parallelizable loops in code with >99% accuracy, opening the door to more efficient automatic parallelization.
Quantum chemistry's density matrix approach reveals interpretable early warning signals of phase transitions in deep learning, from grokking to emergent misalignment.
Chess transformers trained solely on move sequences face a "dual-capability bottleneck" where excelling at both state tracking and decision-making requires carefully balancing data diversity and quality, a tension that simple scaling cannot resolve.
Multimodal AI models learn to be lazy, often ignoring entire modalities, and current active learning methods don't fix the problem.
Radically simpler train loading plans are now possible by implicitly modeling rehandle costs, slashing the complexity of optimization problems.
By mixing flows and using a teacher-student approach, MMAE learns to classify encrypted traffic more accurately than previous masked autoencoders.
By disentangling headers and payloads with a Mixture-of-Experts architecture, TrafficMoE achieves state-of-the-art encrypted traffic classification, proving that heterogeneity-aware modeling is crucial for extracting discriminative features from noisy, encrypted data.
Target networks don't have to be a necessary evil: aligning online and target network estimates can actually *accelerate* RL convergence.
Forget ensembles and retraining: estimate LLM uncertainty with just a single forward-backward pass by assuming parameter covariance isotropy.
You can shrink a spacecraft anomaly detection model by 97% and still catch almost all the problems.
Real-time vocal denoising is now possible with deep learning, achieving significant SNR improvements at under 10ms latency.
Grokking isn't just about local circuits or optimization tricks, but a global structural collapse of redundant model manifolds, revealing a deep connection between compression and generalization.
Forget expensive finetuning: DUME dynamically combines existing expert LLMs into a powerful MoE *without* additional training, unlocking multi-domain performance at minimal cost.
LLMs can better capture human semantic similarity by predicting sets of related concepts instead of single next tokens.
Now, clients can actually *verify* that their data has been removed from a federated learning model, even when the server is untrusted.
LLMs aren't the only path to vulnerability detection: a GNN-based model achieves near-parity with 100x less overhead.
Single-pixel imaging gets a deep learning boost: SISTA-Net leverages learned sparsity and hybrid CNN-VSSM architectures to achieve state-of-the-art reconstruction quality, even in noisy underwater environments.
By directly optimizing clinical dose-volume histogram (DVH) metrics, this method produces 3D dose predictions that more closely align with clinical treatment planning criteria than traditional voxel-wise approaches.
Forget expensive labels: CoRe-DA leverages contrastive learning and self-training to achieve state-of-the-art surgical skill assessment across diverse surgical environments without requiring target domain annotations.
Diffusion models can beat discriminative classifiers at facial expression recognition, but only with a dynamically adjusted margin loss that accounts for per-sample difficulty.
Stop averaging prototypes blindly: FedDBP uses Fisher information to intelligently fuse local prototypes, significantly boosting performance in heterogeneous federated learning.
Passive iFIR filters learned from just three minutes of robot data can dramatically outperform optimized PID controllers in velocity tracking tasks, offering a fast and stable alternative for robot control.
By optimizing PID gains with MPPI, this method achieves comparable performance to conventional MPPI with significantly fewer samples, offering a more sample-efficient approach to learning-based control.
Get kilohertz-level dexterous hand teleoperation *with* formal safety guarantees, thanks to a new convex optimization approach.
Quantum circuit compilation, a major bottleneck, can be sped up by over 15x with minimal overhead using a new parallelization technique validated on 8000 large-scale, configurable random circuits.
Dataflow networks can achieve significant energy savings without sacrificing throughput by strategically powering down actors during idle periods, a balance efficiently discovered using a novel "Hop and Skip" exploration strategy.
Pinpointing performance bottlenecks in large-scale AI training just got 100x faster, thanks to a new system that watches the whole stack without slowing things down.
Achieve up to 4.17x speedup in DRL training by intelligently partitioning tasks across CPUs, FPGAs, and AI Engines on AMD Versal ACAP, demonstrating the power of hardware-aware algorithm design.
Unlock 600,000x faster TSV design by replacing computationally expensive full-wave simulations with physics-informed graph neural networks.
Forget the cold start: training transformers for protein structure prediction peaks at intermediate temperatures, revealing a sweet spot in the loss landscape.
Calculating excited states of molecules with thousands of atoms, previously a computational bottleneck, is now practical on a single GPU thanks to a new implementation of TDDFT-risp.
Scanning every token to focus attention is now passé: HISA prunes irrelevant context blocks *before* token-level scoring, slashing compute without sacrificing selection fidelity.
Forget backpropagation through time: recurrent networks already have temporal credit baked into their forward pass.
Forget painstaking hyperparameter tuning: this hypersphere parameterization lets you transfer a single learning rate across model sizes, depths, and even MoE architectures, slashing compute costs by 1.58x.
Forget heuristics – this work gives provable conditions for *when* and *how* auxiliary data actually improve generalization in transfer learning.
Correcting errors early in the diffusion process matters more than fixing them later: Stepwise-Flow-GRPO leverages this insight to dramatically improve RL-based flow model training.
Unlock $\sqrt{N}$ regret in offline policy learning, even with complex policy classes, by trading off policy and environment complexity.
Backpropagation-free test-time adaptation can be both accurate and efficient: PACE achieves state-of-the-art accuracy while slashing runtime by over 50%.
Models can dynamically grow their own capacity during continual learning, adding parameters only when and where they're needed, without human intervention.
Actor-critic methods can achieve state-of-the-art sample complexity in linear MDPs *without* relying on computationally expensive implicit policies or strong assumptions about exploration.
Narrow ResNets can struggle to represent critical points in input-output mappings, effectively pushing them to infinity and hindering accurate function approximation.
Scaling laws work so well because they capture the essence of computation, not the specifics of implementation, leading to a persistent efficiency arms race.
Escape the tyranny of ill-conditioned optimization landscapes: Yau's Affine Normal Descent offers provably robust convergence by intrinsically adapting to anisotropic curvature through volume-preserving affine invariance.
Higher-order neural networks don't need hypergraphs: SHONNs unlock their power for general-purpose feedforward architectures by sidestepping stability and scaling issues.
Neural networks can turbocharge classical optimization for high-dimensional matrix estimation, achieving faster convergence without sacrificing theoretical guarantees.
Classical models of hydrogen storage in geological formations fall apart when applied to diverse samples, but this physics-informed neural network nails it, achieving R2 = 0.9544.
Second-order federated learning can be made robust and practical: FedRCO overcomes instability issues and outperforms first-order methods in non-IID settings.
Forget smooth sailing: FI-KAN's fractal bases let neural networks conquer non-smooth functions and PDEs with up to 79x better accuracy.
LLMs can reason more accurately and concisely when RL is guided by token-level entropy, pinpointing and exploring "forks in the road" during the reasoning process.