Search papers, labs, and topics across Lattice.
Tsinghua University's AI research group. Leading Chinese institution in NLP, knowledge graphs, and large language models.
100
0
0
Stop training your image restoration models to mimic flawed ground truth; instead, explicitly optimize for perceptual quality using a plug-and-play module guided by No-Reference Image Quality Assessment.
Today's best smartphone GUI agents stumble when faced with the messy reality of personalized user workflows, achieving only limited success on a new benchmark designed to mimic real-world use.
GPT-5 can only solve 37% of PhD-level 3D geometry coding problems, suggesting AI can't reliably automate complex scientific coding tasks yet.
Current multimodal dialogue models struggle to capture the nuanced expressiveness of human interaction, but a new dataset and benchmark reveal exactly where they fall short.
StreamingVLA achieves a remarkable 2.4x speedup and 6.5x reduction in execution halting by asynchronously parallelizing observation, action generation, and execution stages in vision-language-action models.
Ventricular dysfunction can be surprisingly well-predicted in a zero-shot manner from ECG diagnostic probabilities, suggesting a structured encoding of cardiac function within these representations.
LLM agents controlling real-world tools are alarmingly easy to manipulate, with an 85% success rate for privilege escalation attacks, despite exhibiting basic security awareness.
LLMs can learn to generate more "organic" pull requests by distilling coding style, API usage, and architectural invariants from a project's commit history, leading to better acceptance rates.
Stop burying your agent harness logic in code: NLAHs let you express it in natural language, making it portable, editable, and analyzable.
Instruction-guided video editing can achieve impressive zero-shot performance simply by pre-training on motion-centric video restoration tasks *before* fine-tuning on paired editing data.
Forget brittle, hand-coded robot assembly routines: ATG-MoE learns complex, multi-skill manipulation directly from visual and language inputs, achieving impressive success rates in both simulation and real-world industrial tasks.
Video diffusion models can be aggressively quantized down to 6-bit precision with minimal quality loss by dynamically adapting the bit-width of each layer based on its temporal stability.
MLLMs can ace the test, but still fail to *see*—they often succeed at complex reasoning with symbols while failing at basic symbol recognition, revealing a reliance on linguistic priors over true visual perception.
Mimicking human cognition, FLAIR lets dialogue models "think while listening," boosting performance without adding latency.
LLMs can now generate Verilog code that's not just correct, but also optimized for real-world hardware constraints like power, performance, and area, thanks to a novel multi-agent system with evolving memory.
Animate 3D characters using bananas and plush toys – DancingBox turns everyday objects into motion capture proxies, making animation accessible to novices.
Achieve better compression in low-bit quantization by considering not just numerical sensitivity, but also the structural role of each layer.
LLMs struggle with code comprehension, but a simple RNN pass over their embeddings can boost accuracy by over 5%.
Human unpredictability is now a feature, not a bug: a mixed-reality testing framework leverages human interaction to generate high-quality corner cases for vehicle-infrastructure cooperation systems.
A new mixed reality testbed lets you plug real human drivers into a CAV simulation, offering unprecedented realism for testing autonomous vehicle interactions.
Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.
A new RGB-T dataset and frequency-aware network exposes the surprising limitations of existing UAV detectors when faced with real-world camouflage and complex backgrounds.
LLMs can exhibit surprising "strategic realism" when analyzing an ongoing geopolitical conflict, but their reasoning falters in politically ambiguous situations, revealing critical domain-specific limitations.
Jointly training audio watermarking and source separation unlocks robust multi-stream watermarking, enabling independent tracking of individual audio components within a mix.
Negative constraints offer a surprisingly robust path to AI alignment, sidestepping the sycophancy issues inherent in preference-based RLHF.
Directly modeling 3D geometry in dental scans unlocks a 9.58% accuracy boost in multi-disease diagnosis compared to methods relying on 2D or multi-view image representations.
Forget painstakingly creating 3D assets for robot training - ManiTwin automates the process, turning single images into simulation-ready objects at scale.
LLM agents can now leverage a unified memory framework that dynamically adapts to different question types, enabling more coherent and user-centric long-horizon dialogues.
SpeechLLMs can be made significantly faster and more accurate at question answering by explicitly training their attention mechanisms to focus on relevant evidence.
By aligning image and LiDAR features to event-derived spatiotemporal edges, $x^2$-Fusion achieves state-of-the-art accuracy in optical and scene flow estimation, particularly under challenging conditions where other multimodal fusion methods falter.
DriveFix tackles the "shaky camera" problem in 4D driving scene reconstruction, producing significantly more stable and coherent novel views by explicitly modeling spatio-temporal dependencies.
Achieve diffusion-level perceptual quality in monocular depth estimation at 40x the speed, by replacing the slow initial diffusion steps with a fast ViT-based depth map and refining in a compact latent space.
Achieve real-time cattle mounting pose estimation in complex environments with FSMC-Pose, a framework that outperforms existing methods while drastically reducing computational costs.
LLM-based simulations of public opinion suffer from "Diversity Collapse," but injecting explicit social identity representations into hidden states can fix it.
Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.
By explicitly modeling tooth relationships, TCATSeg achieves state-of-the-art accuracy in 3D dental model segmentation, even in challenging pre-orthodontic cases.
Distributional counterfactual explanations are now possible for black-box tabular models, thanks to a novel sparse search algorithm that sidesteps the need for gradients.
By intelligently injecting and removing noise, RaDAR significantly improves recommendation accuracy in sparse and noisy collaborative filtering environments.
LLMs can now scale depth more effectively: a new attention mechanism recovers diluted features in deeper layers, boosting performance with negligible overhead.
Streaming 3D reconstruction gets a free lunch: MeMix, a training-free module, slashes reconstruction errors by up to 40% by selectively updating memory patches, fighting catastrophic forgetting without extra parameters.
LLMs' true power lies in the "unexplainable" – capabilities that exceed rule-based systems, challenging the pursuit of full interpretability.
Scaling LLM-based multi-agent systems doesn't just need better prompts or models, but a whole new software engineering approach focused on managing runtime entropy.
LLMs struggle to effectively use private library APIs even when provided with the correct documentation, but PriCoder can boost their performance by over 20% through targeted training data synthesis.
Forget tedious fine-tuning: leveraging molecule identifiers as visual prompts unlocks surprisingly powerful zero-shot chemical reaction diagram parsing in VLMs.
Autonomous driving models can learn to avoid accidents *before* they happen by training on expert interventions and anticipating errors.
Can AI transform a grumpy cat meme into a beacon of positivity while keeping the cat recognizable?
Generate realistic GPS trajectories across an entire nation with TrajFlow, a new flow-matching model that leapfrogs diffusion-based approaches in scale, diversity, and efficiency.
Ditch expensive, rendering-based RL for autonomous driving: PerlAD uses offline data to train agents in a fast, vector-space pseudo-simulation, outperforming prior methods by 10% on driving score.
A 2B parameter model trained on a new 1.1M dataset can now forecast remote sensing scenes better than Gemini-2.5-Flash Image, suggesting that task-specific training data and methods can beat sheer scale.
LLMs struggle with low-resource general-purpose programming languages, and surprisingly, translating code *to* a low-resource language is harder than generating it from text.
Tool-using agents may seem capable, but they struggle to distinguish neutral actions from errors, highlighting a critical need for better step-level process understanding.
Forget training separate policies for every robot hand – this method learns one policy to control them all, slashing data needs and boosting performance by 50% in cross-embodiment manipulation.
Forget finetuning rare tokens: MoKus leverages cross-modal knowledge transfer to bind diverse textual knowledge to visual concepts, achieving high-fidelity customized generation.
Forget retraining your agent: Steve-Evolving distills execution failures into executable guardrails and successes into reusable skills, injecting them into an LLM planner for continual, parameter-free improvement.
By decoupling patch details from semantics, Cheers achieves state-of-the-art multimodal performance at 20% of the training cost of comparable models.
Achieve 92% accuracy in identifying who's commanding a robot from 34 meters away by fusing IMU and camera data, a 48% leap over prior art.
Cut sparse attention indexing costs by 75% without sacrificing quality by intelligently reusing indices across layers.
Control both multi-subject identity and multi-granularity motion in video generation with DreamVideo-Omni, a framework that uses latent identity reinforcement learning to avoid identity degradation.
Forget brittle retrieval: QChunker uses a question-aware multi-agent debate to restructure RAG from retrieval-augmentation to *understanding*-retrieval-augmentation, boosting performance across diverse domains.
Scaling up LLMs boosts combinatorial creativity in code generation, but plateaus on exploratory tasks, revealing a "convergence-by-scaling" effect where larger models become less divergent.
Floor plan generation gets a major upgrade with HouseMind, a multimodal LLM that uses discrete room-instance tokens to achieve unprecedented geometric validity and controllability.
Autonomous LLM agents are riddled with vulnerabilities, as point defenses fail to address cross-temporal and multi-stage systemic risks like memory poisoning and intent drift.
Current embodied AI agents falter when faced with the multi-floor complexity of MANSION, a new language-driven framework for generating realistic, building-scale 3D environments.
Differentiable physics enables high-resolution 3D tomography of subsurface defects by enforcing thermodynamic laws as hard constraints, outperforming traditional methods and PINNs.
Forget scaling reasoning – this work shows that scaling visual perception using code-grounded data is the real key to unlocking MLLMs' STEM abilities.
Incomplete trajectory data got you down? This plug-and-play framework progressively aligns features from incomplete observations with complete ones, boosting prediction accuracy in autonomous driving scenarios.
Exploit the surprisingly stable, yet heterogeneous, sparsity patterns across attention heads to slash LLM attention latency by 2.88x without sacrificing quality.
A compact 0.9B multimodal model, GLM-OCR, achieves state-of-the-art document understanding by predicting multiple tokens at once, boosting decoding throughput without blowing up memory.
Achieve better video editing without retraining by dynamically locking background features based on a "hallucination metric" that detects when the diffusion model is about to go astray.
LLMs in collaborative coding often stumble on interaction subtleties, leading to a new class of problems called "Interaction Smells" that can now be systematically identified and mitigated.
By strategically increasing hash collisions, Nemo slashes write amplification in flash caches for tiny objects, a persistent bottleneck even with advanced SSDs.
K-means, previously relegated to offline processing, gets a 17.9x speed boost on modern GPUs thanks to Flash-KMeans' clever IO and contention optimizations.
RiO-DETR makes real-time oriented object detection with transformers a reality by cleverly decoupling angle estimation and injecting angular diversity into dense supervision.
Forget tweaking knobs – this new Gram-matrix-based audio representation lets you *retrieve* the perfect, editable audio effect preset, outperforming standard methods.
Event cameras can now estimate depth with significantly improved temporal consistency and accuracy thanks to a novel distillation approach from video foundation models, achieving a 53% reduction in depth error.
By learning visual representations from scene-level semantics down to pixel-level details, C2FMAE overcomes the limitations of both contrastive learning and masked image modeling.
Get 2x faster video generation from diffusion transformers without sacrificing quality, thanks to a clever parameter-free error compensation technique.
Stop predicting the future, start predicting *change*: $Δ$VLA guides robotic action by modeling how world knowledge *varies* under actions, not by forecasting absolute future states.
Forget hand-tuning: PolyFormer learns to automatically simplify complex, physically-constrained optimization problems into efficient polytopic reformulations, achieving massive speedups and memory reductions.
LLMs can switch between reasoning and factual answering on the fly, without retraining, simply by conditioning on specific token prefixes.
RAMBO's instability got you down? ROMI offers a robust, value-aware model learning approach with implicitly differentiable adaptive weighting that outperforms RAMBO and other SOTA methods in offline RL benchmarks.
Token-level Mixture-of-Experts, directly ported from LLMs, can actually *hurt* autonomous driving performance in VLA models; SAMoE-VLA fixes this with scene-adaptive expert selection, achieving SOTA results with fewer parameters.
LLMs can significantly boost micro-expression recognition by reasoning about subtle facial muscle movements when guided by structured visual and relational prompts.
LLMs can automate and improve thematic analysis of qualitative data, achieving expert-level alignment in clinical domains through iterative codebook refinement.
MLLMs can now reliably interpret electromagnetic signals even in noisy environments, thanks to a new training framework and benchmark designed specifically for this challenging domain.
Adversarial training and synthetic data can significantly boost multilingual speaker verification performance, even with limited training data.
Beat the LLM inference bottleneck: SageSched's uncertainty-aware scheduling boosts efficiency by nearly 30% by predicting output length and balancing compute and memory demands.
Standard PINNs stumble in complex geometries, but MUSA-PINN leaps ahead by reformulating PDE constraints as multi-scale integral conservation laws, slashing errors by up to 93% in fluid flow simulations.
Intrinsic reward signals in unsupervised RL for LLMs inevitably collapse due to sharpening of the model's prior, but external rewards grounded in computational asymmetries offer a path to sustained scaling.
Achieve safer and more effective human-robot collaboration by decoupling task execution from human interaction using a redundant robot's null space.
Text-to-image customization can now preserve the original model's behavior, thanks to a decoupled learning objective that balances new concepts with pre-existing capabilities.
Achieve nearly 2x speedup in Stable Diffusion 3 by intelligently stitching together large and small diffusion models at both the pixel and timestep level.
LLMs under pressure to survive exhibit surprisingly frequent and diverse risky behaviors, from financial fraud to misinformation, highlighting a critical safety gap in agentic AI.
Current LLM safety measures are critically vulnerable to attacks grounded in Thai cultural nuances, as demonstrated by a new benchmark showing higher attack success rates compared to general Thai-language attacks.
RAG4CTS achieves state-of-the-art time-series forecasting by ditching static embeddings for a hierarchical, physics-informed retrieval approach that leverages raw historical regimes.
Ditch the optimization: MoRe achieves real-time 4D scene reconstruction from monocular video using a feedforward transformer that disentangles motion and structure.
Aura unlocks more accurate aviation time series forecasting by explicitly modeling how different types of external factors interact with temporal dynamics.
Current judge models for instruction-following are surprisingly unreliable, but a new benchmark exposes their flaws and offers a path to better alignment.
Group chats can be revitalized with LLM-powered agents, boosting message volume by nearly 30% in real-world deployments.
Interpolating latent representations before decoding yields a reconstruction FID (iFID) that finally aligns with the generation FID of latent diffusion models, achieving ~0.85 correlation where standard rFID fails.