Search papers, labs, and topics across Lattice.
Tsinghua University's AI research group. Leading Chinese institution in NLP, knowledge graphs, and large language models.
100
0
0
Adversarial clothing with non-overlapping visible-thermal patterns can reliably evade RGB-T detectors, even transferring across different fusion architectures.
Ethereum builder centralization isn't just about who has the best order flow, but also about how network effects let incumbents decouple from needing exclusive deals.
Tabular data synthesis no longer needs to sacrifice privacy for quality: pretraining on diverse datasets lets models generalize from limited context, breaking the traditional tradeoff.
Quadrupedal robots can now perform dynamic loco-manipulation in the real world, matching human teleoperation, using only onboard ego-centric vision and a low-frequency (5Hz) open-vocabulary detector.
Margin loss fine-tuning of ECAPA-TDNNs slashes the error rate in spoken language identification by over 50%, highlighting the power of discriminative representation learning.
Forget separate structure and fidelity models – Khala shows you can generate high-quality music with text-vocal alignment using a single acoustic-token hierarchy.
Instead of training separate video diffusion models for each multimodal task, UniVidX learns a single model that handles diverse pixel-aligned video generation problems.
Simple, artist-friendly quad meshes can now be automatically generated on 3D shapes using a diffusion model trained on a continuous surface representation, sidestepping the complexity of discrete mesh optimization.
Forget manual skill annotation: Ctx2Skill lets language models teach themselves to master complex contexts, unlocking better reasoning without human intervention.
Ignoring the nuanced interplay between services and hosts in microservice architectures leaves nearly 50% of root causes undiscovered.
Semantic priors in neural speech codecs hit a wall: their benefits plateau beyond 6 kbps, revealing a fundamental limit to improving intelligibility at higher bitrates.
Untangling task-solving skills from factual knowledge in PRAG adapters makes them play better together, boosting performance when you combine multiple documents.
Multimodal perception is no longer just an add-on: GLM-5V-Turbo bakes it directly into the core of reasoning, planning, and action.
Fine-tune massive LLMs like Qwen3-235B with 31K context on a single 8x RTX 4090 server, thanks to a novel pipeline schedule that eliminates the weight binding bottleneck.
Robots can now navigate complex outdoor environments using only high-level human instructions and readily available GPS/map data, bypassing the need for expensive HD maps or limited short-horizon policies.
Decentralized debate among LLM agents doesn't just select the best solution for optimization modeling; it structurally enables agents to refine flawed candidates and even recover correct formulations through interaction.
Ditch the pixel-perfect edits: letting multimodal models fully *reimagine* images based on semantic understanding yields massive quality gains in refinement tasks.
Multi-agent LLM systems are leaving performance on the table by treating structured agent interactions as generic traffic; Pythia shows how to unlock substantial gains by exploiting workflow semantics at the serving layer.
Imagine specifying complex 3D articulations with just a few 2D sketches – Sketch2Arti makes it a reality.
LLMs can bootstrap their understanding of private APIs by autonomously learning from their own coding attempts, outperforming retrieval-augmented generation by 16% on code generation tasks.
Explicitly enumerating skills in-context doesn't scale for agentic LLMs, but retrieving skills on demand can substantially improve performance – if the LLM can figure out when and which skill to load.
LLMs can now generate driving rules from traffic laws with significantly improved accuracy by grounding their reasoning in structured traffic scenarios.
Finding similar analog circuits across netlists, schematics, and descriptions just got way easier: a new model achieves 75% recall, unlocking better circuit design automation.
By unifying generative and discriminative approaches, UniGenDet achieves superior image generation and detection, suggesting that these tasks benefit from a symbiotic relationship previously hindered by architectural divergence.
Point-VLMs can learn to see the world as it really is: targeted reward assignment and cross-modal verification nearly close the reality gap in 3D reasoning.
MLLMs often *hallucinate* the referent of a pointing gesture, latching onto nearby or salient objects instead of truly understanding spatial semantics.
Unlock higher-capacity covert communication with LLMs: a new steganography scheme uses list decoding to substantially outperform existing methods without sacrificing security or efficiency.
Achieve superhuman dexterity: ALAS unlocks robust long-horizon task completion by decoupling environment understanding from motor control, enabling generalization across diverse human-scene interaction scenarios.
MLLMs still struggle to integrate diverse data for clinical reasoning, as evidenced by their poor performance on a new ophthalmology benchmark spanning image quality assessment to diagnosis.
Pocket-sized VLA models can now achieve state-of-the-art robot manipulation performance by pre-training on a curated multimodal dataset and injecting manipulation-relevant representations into the action space.
LLMs can reason more effectively by directly tracking their own belief in the correct answer throughout the reasoning process, enabling more targeted policy updates.
TurboQuant's claimed advantages over RaBitQ in quantization don't hold up under rigorous, reproducible comparison, raising questions about its practical utility.
Stop fragmented land cover predictions: SSDM leverages global geospatial embeddings to guide local feature extraction, achieving state-of-the-art performance in high-resolution remote sensing mapping.
Freezing a Stable Diffusion backbone and injecting CLIP and BLIP features lets you beat the state-of-the-art in zero-shot sketch-based 3D shape retrieval, without any costly retraining.
Training-free diffusion models can now harmonize satellite imagery across diverse domains, enabling scalable remote-sensing synthesis without retraining.
LLMs don't see cities neutrally; their perception is skewed towards a culturally uneven baseline, favoring Western perspectives.
LLMs can fix 26% more bugs when given access to intermediate runtime states during program repair, proving that even the best models struggle to infer root causes from just failure symptoms.
Time-to-collision metrics miss critical collision risk information, but a new 2D acceleration-based metric anticipates collisions far better.
MV-HGNN achieves superior 3D shape retrieval by effectively leveraging geometric dependencies and semantic alignment, outperforming existing methods in zero-shot settings.
LLMs can significantly boost their emotional intelligence simply by role-playing conversations with themselves, iteratively refining their ability to both recognize and express emotions.
LLMs can reason better over noisy and distributed information when you break down RAG into specialized agent roles for summarization, extraction, and reasoning.
Agentic AI's fragility stems from relying on LLMs for system control, but Arbiter-K flips the script by using a deterministic kernel to govern the LLM, achieving up to 95% unsafe action interception.
VLAs can learn to adapt to new environments at test time without any fine-tuning, achieving significant performance gains on robotic manipulation and Atari games.
Autoregressive video diffusion gets a 2x speed boost with minimal quality loss, thanks to a clever speculative decoding approach that uses an image-quality router to verify proposed video blocks.
Targeted neuron fine-tuning can unlock superior image translation capabilities in multimodal large language models, outperforming traditional methods by preserving pre-trained knowledge.
Autoregressive 3D layout generation can be both more physically plausible and significantly faster by repurposing existing 3D generative models.
LLMs don't just reflect gender bias in public vs. private spaces; they encode nuanced, micro-level mappings that substantially exceed real-world distributions, shaping spatial gender narratives in unexpected ways.
MLLMs still struggle to reason about everyday situations when they require identifying and using visual clues, despite excelling at tasks relying on pre-existing knowledge.
RL can teach LLMs to be better interviewers, adaptively prompting users to reveal hidden information in dialogue.
MLLMs don't just forget language, they also suffer from perceptual drift in cross-modal spaces, but MAny offers a training-free merging strategy to fix both.
Achieve photorealistic, identity-consistent facial video edits from text prompts without video training data, rivaling traditional rendering software.
Simply plugging in RoTE, a lightweight temporal embedding module, can boost existing Transformer-based sequential recommendation models by over 20% on standard benchmarks.
Synthesizing realistic anomaly images for industrial assembly is now possible thanks to a diffusion model that respects component pose and assembly relationships.
Extracting agricultural parcels from satellite imagery gets a whole lot harder (and more realistic) with a new dataset focused on the complex, irregular, and heterogeneous terrain of terraced farms.
LLMs underperform traditional ML methods in software fairness tasks, challenging the assumption that they offer a silver bullet solution for bias mitigation.
Continual learning just got a turbo boost: C-Flat Turbo cuts training time by up to 25% without sacrificing accuracy, thanks to a clever gradient-skipping trick.
Current Chinese AI-generated text detection benchmarks are too homogeneous; C-ReD fixes this with real-world prompts and diverse LLMs, enabling better generalization.
Finally, a model that speaks fluent Lottie: LottieGPT generates editable vector animations directly from text or images, opening up a new frontier for resolution-independent, compact, and semantically structured multimedia creation.
Achieve state-of-the-art object detection accuracy and efficiency by fusing RGB frames and event streams with a sparse hypergraph and a fine-grained mixture of experts, enabling real-time edge deployment.
See how ideas like "democracy" or "freedom" have subtly shifted their meaning across different news sources and time periods, all within a single, comparable framework.
You can now train your capacitance extraction models on a diverse, multi-PDK dataset of open-source designs, but be ready to trade accuracy for speed when choosing between CNNs and GNNs.
Current memory systems, despite their complexity, are surprisingly worse than naive RAG when applied to continuous lifelogging scenarios, revealing a critical need for better context preservation.
Achieve superior 3D scene reconstruction from aerial images with significantly reduced transmission overhead by directly optimizing communication for rendering quality.
Forget human-annotated datasets: MathAgent synthesizes mathematical reasoning data so effectively that models trained on just 1K generated examples outperform those trained on existing datasets.
DPO might not be the only game in town: a decision-directed approach to reward modeling can outperform it in pairwise preference optimization.
Forget complex disentanglement architectures or low-quality synthetic targets: MimicLM achieves superior voice imitation by cleverly using synthetic speech as the *source* and real speech as the *target* in a pseudo-parallel training setup.
By explicitly modeling both consensus and discrepancy between RGB and IR data, this text-guided multispectral object detector significantly boosts performance on multispectral benchmarks.
A surprisingly simple VLA model, StarVLA-$\alpha$, beats more complex systems on real-world robotics tasks, suggesting that VLM backbones are more critical than intricate architectures.
By unifying contrastive and reconstructive learning with targeted augmentations, CoRe-ECG extracts more robust and physiologically meaningful representations from unlabeled ECG data than existing self-supervised methods.
Unlock zero-shot generalization in robot manipulation by generating diverse, affordance-aware training data with 3D generative models and Vision Foundation Models.
Attention Sink, where Transformers fixate on seemingly irrelevant tokens, is more than just a quirk – it's a fundamental challenge impacting training, inference, and even causing hallucinations, demanding a systematic approach to understanding and mitigating its effects.
Robots can now focus on the *right* body parts for interaction, thanks to a new vision-language model that understands human motion commands and precisely localizes task-relevant 3D keypoints.
Medical MLLMs, despite their size and training data, stumble on basic image classification due to four key failure modes, revealing a disconnect between hype and clinical readiness.
Twitch developers' reliance on Discord for support creates a form of "platform labor" as they bridge the gap between formal platform support and informal community assistance.
Neural synchronization, long hypothesized to support flexible coordination in biological brains, can now be harnessed to improve the learning efficiency of Vision Transformers.
Synthesizing novel views from extrapolated poses no longer requires dense supervision, thanks to a geometry-conditioned diffusion model that explicitly learns to handle out-of-trajectory artifacts.
Achieve state-of-the-art real-world image dehazing by jointly reconstructing the clear scene and scattering variables, even with non-uniform haze and complex lighting.
Achieve state-of-the-art metal artifact reduction in CT images with MARMamba, a Mamba-based model that's both lightweight and preserves anatomical structure.
By reflecting on its own reasoning, ReflectRM achieves a +10.2 improvement in mitigating positional bias compared to leading generative reward models, making it a far more stable evaluator.
Hierarchical RL can tame the curse of dimensionality in fleet management, enabling superior maintenance and logistics decisions compared to monolithic approaches.
Humans are still way better than LLMs at trial-and-error problem solving, and this new dataset of human problem-solving trajectories shows us why.
Generating coordinated bimanual grasps on diverse objects is now possible thanks to a dataset of nearly 10 million grasps and a model that adapts to object geometry and size.
SubFLOT tackles federated learning's heterogeneity problem by cleverly using optimal transport to create personalized submodels on the server, sidestepping the computational burden of client-side pruning.
Current multimodal LLMs struggle with guideline-constrained clinical reasoning, but a simple multi-agent framework can significantly boost their performance on real-world lung cancer diagnosis and treatment.
Legged robots can now recover from sensor noise and crazy user commands with 10x greater reliability, thanks to a new method that respects the robot's competence boundaries.
Forget global context – ReAlign leverages a stronger VLM to generate *local*, reasoning-guided descriptions that boost visual document retrieval by up to 2%.
Forget fixed pipelines: training an agent to *learn* when and how to search for knowledge dramatically improves performance on knowledge-based visual question answering.
Forget tedious hyperparameter sweeps; AutoSOTA automates the *entire* research pipeline, discovering 105 new SOTA models across diverse AI tasks in just five hours per paper.
Existing multimodal sentiment analysis models crumble under real-world noise, but QA-MoE leverages uncertainty to dynamically route inputs, achieving robust performance across a continuous spectrum of data quality.
LLMs can rediscover known algorithms, but only after targeted unlearning and with the help of a generative verifier to avoid "thought collapse," revealing both the innovative potential and limitations of these models.
Achieve state-of-the-art 3D object detection in adverse weather by adaptively routing between LiDAR, radar, and fused features based on learned weather conditions.
Synthesizing realistic human mobility in data-scarce regions is now possible thanks to a dual-LLM-agent framework that learns physical constraints via reinforcement learning.
Frontier video models like Veo-3 can generate surprisingly good task-level plans for robot manipulation, but still need help with the fine details.
Current multimodal models can't handle the rapid-fire tactical analysis required for boxing commentary, as revealed by a new dataset and evaluation framework.
Finally, underwater SLAM can produce photorealistic maps thanks to a novel medium-aware Gaussian map representation.
VPNs, relied upon for secure and private browsing, are surprisingly susceptible to session manipulation attacks from co-located users on the same server.
LLMs can now recommend talent without falling prey to position bias, thanks to a new architecture that understands candidate relationships.
LLM agents get stuck in error feedback loops, but ProCeedRL's process-level critic and reflection-based demonstrations can actively break these cycles and substantially improve exploration.
LLMs, like humans, exhibit a "frequency bias," performing better when prompted and fine-tuned with more common textual expressions.
Language models are increasingly doing their real work in the "invisible" latent space, not the tokens we see.