Search papers, labs, and topics across Lattice.
46 papers from Tsinghua AI on Architecture Design (Transformers, SSMs, MoE)
Forget separate structure and fidelity models – Khala shows you can generate high-quality music with text-vocal alignment using a single acoustic-token hierarchy.
Margin loss fine-tuning of ECAPA-TDNNs slashes the error rate in spoken language identification by over 50%, highlighting the power of discriminative representation learning.
Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.
Forget turn-based interactions: MiniCPM-o 4.5 lets you build AI that sees, hears, speaks, and *reacts* in real-time, all on a device with only 12GB of RAM.
Forget fully connected relation graphs: CasLayout's sparse relation modeling unlocks enhanced controllability and realism in 3D indoor scene synthesis.
Simple, artist-friendly quad meshes can now be automatically generated on 3D shapes using a diffusion model trained on a continuous surface representation, sidestepping the complexity of discrete mesh optimization.
Semantic priors in neural speech codecs hit a wall: their benefits plateau beyond 6 kbps, revealing a fundamental limit to improving intelligibility at higher bitrates.
Untangling task-solving skills from factual knowledge in PRAG adapters makes them play better together, boosting performance when you combine multiple documents.
Autonomous vehicles can now plan trajectories 10x faster without sacrificing performance, thanks to a novel architecture that learns complex driving behaviors in latent space during training.
By unifying generative and discriminative approaches, UniGenDet achieves superior image generation and detection, suggesting that these tasks benefit from a symbiotic relationship previously hindered by architectural divergence.
Autoregressive generative models, previously unsuitable for real-time target speaker extraction, can now achieve offline-level performance in streaming scenarios thanks to a novel chunk-wise splicing technique.
A custom-designed tendon-driven wrist, combined with a particle-spring model, enables precise and robust control of highly flexible objects like spinning handkerchiefs.
Ditching caches for compiler-managed data streams, Li Auto's M100 architecture achieves higher utilization than GPUs on autonomous driving tasks, hinting at a new path for efficient AI inference.
Agentic AI's fragility stems from relying on LLMs for system control, but Arbiter-K flips the script by using a deterministic kernel to govern the LLM, achieving up to 95% unsafe action interception.
RL fine-tuning of discrete diffusion models can be made dramatically more stable and effective by treating the final denoised sample as the action and reconstructing trajectories using the forward diffusion process.
Autoregressive 3D layout generation can be both more physically plausible and significantly faster by repurposing existing 3D generative models.
Decoupling LLM prefill and decode across datacenters is now practical, unlocking independent scaling and resource elasticity, thanks to a system that combines KV-efficient models with intelligent request scheduling.
MLLMs don't just forget language, they also suffer from perceptual drift in cross-modal spaces, but MAny offers a training-free merging strategy to fix both.
Simply plugging in RoTE, a lightweight temporal embedding module, can boost existing Transformer-based sequential recommendation models by over 20% on standard benchmarks.
LLM agent harnesses are surprisingly vulnerable, but weaving security directly into the agent lifecycle can slash attack success by 42% without sacrificing utility.
Finally, a model that speaks fluent Lottie: LottieGPT generates editable vector animations directly from text or images, opening up a new frontier for resolution-independent, compact, and semantically structured multimedia creation.
Achieve state-of-the-art object detection accuracy and efficiency by fusing RGB frames and event streams with a sparse hypergraph and a fine-grained mixture of experts, enabling real-time edge deployment.
Attention Sink, where Transformers fixate on seemingly irrelevant tokens, is more than just a quirk – it's a fundamental challenge impacting training, inference, and even causing hallucinations, demanding a systematic approach to understanding and mitigating its effects.
Ditch the slow per-scene optimization: SurfelSplat reconstructs surfaces from sparse views in under a second, matching state-of-the-art accuracy with a 100x speedup.
Millisecond-scale forecasting of reactor thermal-hydraulics, even with missing sensors, is now possible thanks to a physics-informed GNN-ODE digital twin that learns interpretable heat-transfer scaling.
Achieve state-of-the-art metal artifact reduction in CT images with MARMamba, a Mamba-based model that's both lightweight and preserves anatomical structure.
LLMs can generate syntactically valid software architectures from requirements, but their struggle with relational reasoning leads to structurally unsound designs.
Existing multimodal sentiment analysis models crumble under real-world noise, but QA-MoE leverages uncertainty to dynamically route inputs, achieving robust performance across a continuous spectrum of data quality.
Ditch static data paths: TENT dynamically slices and sprays LLM data across heterogeneous interconnects, self-healing in under 50ms and boosting throughput by up to 36%.
By intelligently injecting and removing noise, RaDAR significantly improves recommendation accuracy in sparse and noisy collaborative filtering environments.
Scaling LLM-based multi-agent systems doesn't just need better prompts or models, but a whole new software engineering approach focused on managing runtime entropy.
Floor plan generation gets a major upgrade with HouseMind, a multimodal LLM that uses discrete room-instance tokens to achieve unprecedented geometric validity and controllability.
Cut sparse attention indexing costs by 75% without sacrificing quality by intelligently reusing indices across layers.
A compact 0.9B multimodal model, GLM-OCR, achieves state-of-the-art document understanding by predicting multiple tokens at once, boosting decoding throughput without blowing up memory.
By strategically increasing hash collisions, Nemo slashes write amplification in flash caches for tiny objects, a persistent bottleneck even with advanced SSDs.
Aura unlocks more accurate aviation time series forecasting by explicitly modeling how different types of external factors interact with temporal dynamics.
Ditch the optimization: MoRe achieves real-time 4D scene reconstruction from monocular video using a feedforward transformer that disentangles motion and structure.
By explicitly disentangling degradation and semantic features with wavelet attention, CWP-Net achieves superior all-in-one image restoration, outperforming previous methods hampered by spurious correlations and biased degradation estimation.
Generative recommendation gets a boost: APAO tackles the training-inference gap by intelligently optimizing for prefixes, leading to better candidate retention during beam search.
Get 10x faster generative image compression on GPUs with ProGIC, a lightweight RVQ codec that doesn't sacrifice perceptual quality.
LLMs can now handle autonomous driving tasks with greater precision and efficiency thanks to DriveCode, which replaces discrete number tokens with continuous embeddings.
Student's t priors in function-space Bayesian regularization unlock more robust uncertainty estimates and better handle distribution shifts compared to Gaussian priors.
LLM serving can achieve 5.6x higher throughput without sacrificing latency by decoupling preemption granularity from scheduling frequency.
Unlock 1.7x throughput gains on multi-chip neural network accelerators by jointly optimizing the pipelining of multiple layers, a dimension previously overlooked.
Achieve scalable and consistent multi-reference image editing by dynamically serializing reference images into a coherent latent sequence, outperforming existing diffusion-based methods.
Constraining initial state representations with a simple Tanh activation and skip connections can significantly boost off-policy RL performance, rivaling more complex methods on continuous control tasks.