Search papers, labs, and topics across Lattice.

Alibaba's global research initiative. Publishes actively on NLP, multimodal models, and AI systems.
100
3
0
MTP acceptance rates can be dramatically improved by addressing entropy fluctuations, leading to up to 1.8x faster RL training.
Shifting credit assignment to fine-grained decision points boosts agentic RL performance by nearly 4 points, challenging the conventional focus on tool-call boundaries.
Privilege-induced style drift can undermine reasoning model performance, but RLCSD effectively redirects the learning signal to focus on what truly matters—task-relevant tokens.
Recursive composition of verifiable environments can boost reasoning performance in RL by up to 3.1 points while using only a fraction of the original environments.
K-Forcing accelerates token generation by 2.4-3.5x without abandoning the autoregressive backbone, making it a game-changer for high-load deployments.
FlowTracer reveals that optimizing token-level rewards based on attention-induced information flow can dramatically enhance reasoning performance in LLMs.
Bootstrapping LLM agents to co-evolve as both agent and environment can lead to significant performance gains, with an average improvement of over 4% on complex tasks.
State-of-the-art generative models struggle to maintain physical consistency and coherent interactions over time, revealing critical gaps in their world modeling capabilities.
A conflict-aware approach to decoding can triple resistance to errors in LLMs while maintaining accuracy, fundamentally changing how we handle knowledge conflicts in AI.
Z-Reward achieves 41.3% better human preference alignment in text-to-image generation by transforming complex reasoning into efficient score distributions.
Achieving a 20.24 percentage-point improvement in task success rates while slashing inference costs by 92% could revolutionize automated documentation verification in cloud environments.
CapRL++ redefines caption quality through utility, enabling models to produce high-fidelity descriptions without the constraints of traditional supervised fine-tuning.
LexRubric reveals that even state-of-the-art LLMs struggle with open-ended legal tasks, exposing critical gaps in their contextual understanding and reasoning abilities.
Leveraging historical solving traces transforms software engineering agents into self-evolving entities, achieving a 50.40% success rate on SWE-bench Verified after just three iterations.
QueryAgent-R1 achieves a remarkable 2.9% boost in query CTR and 3.1% in product conversion rates by aligning query generation with actual product retrieval.
FRAP achieves substantial improvements in performance estimation under distribution shifts by effectively merging the strengths of foundation and base models.
Streaming reasoning steps can boost multi-agent system performance by 7.3 percentage points on average, revealing a new dimension for scaling effectiveness and efficiency.
EvoTrainer reveals that co-evolving LLM policies with their training harnesses can significantly outperform static approaches, especially in complex tasks like software engineering.
Skill-RM achieves superior performance by dynamically orchestrating diverse evaluation criteria, reshaping how we approach reward modeling in AI.
Seamless transitions between speech and singing modes are now driven purely by text context, achieving state-of-the-art results in code-switching synthesis.
Harness-1's innovative use of externalized state management leads to an 11.4 point increase in retrieval performance, setting a new standard for search agents.
Video diffusion models can achieve superior human motion control by leveraging 3D mesh tokenization, revealing a deeper understanding of 3D structures than previously thought.
Tool-augmented multimodal agents may appear to excel, but they often rely on learned tool-calling patterns rather than enhanced problem-solving abilities.
Early-stopping can save over 20% of compute while improving reasoning accuracy in large language models.
A single feed-forward transformer now achieves state-of-the-art performance across diverse video geometry estimation tasks, rivaling specialized architectures.
Forget complex architectures: a simple transformer can generate metric-accurate dense depth maps from sparse observations, outperforming existing methods.
LLMs can learn to synthesize data more effectively by accumulating and transferring experience across a stream of sequential synthesis tasks, opening the door to more efficient and adaptable synthetic data generation.
Freezing a Sparse Autoencoder's encoder creates a reusable "safety dictionary" that generalizes to new risks in text-to-image diffusion models, offering a more robust alternative to fixed-layer steering.
Forget static rubrics and expensive external models: EvoRubric co-evolves a single policy to generate both responses and the rubrics to evaluate them, outperforming traditional RLHF methods in open-ended generation tasks.
Unlock a treasure trove of free training data: SIGMA turns millions of unannotated image edits into high-quality pixel masks, boosting image manipulation detection by 18%.
Closing the sim-to-real gap in vision-language navigation requires benchmarks grounded in realistic 3D reconstructions, not just generated scenes.
Autonomous agents struggle to retain instructions when burdened with retrieving information from the open web, exposing a critical retrieval-reasoning trade-off.
LLM memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment, and can be automatically corrected with prompt optimization guided by fine-grained error tracing.
Uniboost decouples complex weighting schemes in recommendation systems, enabling precise attribution of each traffic allocation plan's contribution and boosting overall efficiency.
Stop wasting compute on redundant retrieval calls: DynFrame learns the optimal frame sampling density *within* each temporal window, slashing context length and boosting performance on complex video understanding tasks.
DiT activations are far more amenable to semi-structured sparsity than weights, unlocking significant inference speedups without sacrificing generation quality.
Long video generation fails not just because of limited context length, but because of *how* that context is allocated – and ReCA's hierarchical approach shows a way to fix it.
By explicitly conditioning on the query, QGS achieves a 0.62% CTR increase in a major commercial search engine, proving that generative models can beat traditional deep learning baselines in search ranking when query context is properly handled.
StreamChar achieves real-time audio-video generation with unprecedented fidelity and synchronization, overcoming the limitations of traditional autoregressive models.
Scribble-guided image editing struggles more with understanding diverse instructions than with handling real-world images, a counterintuitive finding that unlocks significant performance gains.
Pathology foundation models excel at clinical tasks, but SpaPath-Bench reveals they differ significantly in their ability to capture the spatial organization of tissues, highlighting a critical gap for spatially-aware applications.
Forget static hyperparameters: DVAO dynamically adjusts reward weights based on variance, leading to more stable and effective multi-objective RLHF.
Knowledge injection, reasoning supervision, and preference optimization can be combined to substantially improve semantic relevance judgment, outperforming even strong LLM baselines.
Forget hand-crafted benchmarks: CUA-Gym's auto-generated training data lets computer-use agents crush existing open-source models on real-world tasks.
By distilling lookahead planning into a lightweight generator, DeGRe achieves state-of-the-art recommendation reranking with a single, efficient greedy decoding pass.
Get 5.3% more clicks by intelligently scaling your CTR model's inference depth only when it's uncertain, without retraining or increasing worst-case latency.
Overweighting easy-to-reconstruct features in generative CTR models is leaving performance on the table, especially for cold-start and long-tail users.
Frontier LLMs still struggle with preference coverage and group fairness when planning travel for multiple users, revealing a critical gap in real-world agent capabilities.
LLMs can now autonomously engineer features as executable code, outperforming traditional methods and unlocking significant gains in real-world cloud resource optimization.
Forget maps: LLMs can learn end-to-end transit route planning directly from data, even grounding GPS coordinates without explicit mapping.
Forget expensive data curation: a simple, training-free entropy metric lets you train LLMs on just 20% of your reasoning data without sacrificing performance.
DiTs are leaving performance on the table by using vanilla residual connections, and a simple timestep-adaptive routing mechanism can unlock significant gains in both training efficiency and final image quality.
Multimodal LLMs struggle to pinpoint objects from nouns alone, but SWIM training realigns vision and language to outperform visual-prompt methods.
Full-attention LLMs are intrinsically sparse and can be transformed into highly efficient sparse models with minimal training, sidestepping the need for expensive sparse pre-training.
Fine-tuning efficient few-step diffusion models no longer requires sacrificing their speed, thanks to a self-distillation approach that preserves inference capabilities.
On-device LLMs can now drive real-time recommendation improvements, unlocking faster adaptation to evolving user intent without cloud reliance.
Technical artists overwhelmingly prefer this new method for single-image head mesh reconstruction, finding it closest to industry-grade usability.
Despite impressive OCR performance on existing benchmarks, today's best LMMs still struggle with the messy realities of enterprise document processing.
Today's visual generation models are often evaluated on the wrong things, leading to inflated performance claims that mask critical failures in spatial reasoning, temporal consistency, and causal understanding.
Transfer learning can unlock scalable emission control across diverse waste incineration plants by learning transferable system-level structures that capture physical constraints, operating-regime heterogeneity, and carbon-pollutant coupling.
Multi-agent LLM systems are leaving performance on the table by treating structured agent interactions as generic traffic; Pythia shows how to unlock substantial gains by exploiting workflow semantics at the serving layer.
Multilingual MoEs can achieve best-in-class performance-to-compute ratios, even with extreme sparsity, by strategically upcycling from dense models and exhibiting structured expert activation patterns across languages.
Vanilla on-policy distillation falls apart in multi-turn settings due to compounding errors, but a simple curriculum on trajectory length fixes it, even letting students beat their teachers.
Predicting pre-promotion conversions in e-commerce gets a boost with a new model that understands how users "window shop" before sales actually start.
LLMs can now directly predict geographic coordinates with high accuracy, even for vague locations and complex regions, bypassing the need for traditional geocoding pipelines.
TPGO allows multi-agent systems to learn from their own optimization history, leading to unprecedented self-improvement in performance.
Achieve state-of-the-art person re-identification with only 20% of the data by explicitly teaching the model to "think" before matching identities.
LLMs still struggle to reason in context when cultural and linguistic nuances are involved, achieving only 44% accuracy on a new grounded benchmark spanning 14 languages.
RL fine-tuning of discrete diffusion models can be made dramatically more stable and effective by treating the final denoised sample as the action and reconstructing trajectories using the forward diffusion process.
Allowing multiple support strategies in a single utterance can dramatically enhance the quality of emotional support conversations, leading to more effective dialogue outcomes.
Diffusion models are making mistakes because they're losing track of time, but a simple frequency-aware correction can get them back on track.
Skip the costly human annotations: PromptEcho distills reward signals directly from frozen VLMs to boost text-to-image RL, achieving state-of-the-art results without any reward model training.
Stop retraining your diffusion models for every device: OFA-Diffusion lets you extract the right-sized model in a single training run.
Continual learning just got a turbo boost: C-Flat Turbo cuts training time by up to 25% without sacrificing accuracy, thanks to a clever gradient-skipping trick.
Unlock geometric reasoning in MLLMs by parsing diagrams into a unified formal language that spans both 2D and 3D geometry.
Ditching the critic doesn't mean sacrificing fine-grained credit assignment: RTMC leverages overlapping states in rollout trees to estimate per-step Q-values, outperforming critic-free baselines on SWE-bench.
Unsupervised RL for text generation doesn't have to collapse into gibberish: rewarding relative information gain between specialist and generalist policies unlocks meaningful content creation.
No single AI model dominates across all professional industries, revealing distinct occupational capability profiles and highlighting the need for specialized AI development.
Decomposing 4D point cloud videos into spectral frequency bands unlocks superior geometric understanding, boosting performance on action recognition and semantic segmentation.
Unlock zero-shot medical image analysis with MedP-CLIP, a model that understands both the big picture and the critical details, outperforming baselines in tasks from recognition to segmentation.
Forget prompt engineering: E2E-REME directly generates executable Ansible playbooks from diagnosis reports, outperforming large LLMs in microservice auto-remediation accuracy and efficiency.
Human-like evaluation of long-form generative AI is now possible, thanks to a new framework that breaks down reference answers into weighted, context-aware scoring points.
Dense neural networks are choking on sparse recommendation data, but SSR's explicit sparsity unlocks continuous performance gains where dense models saturate.
RL fine-tuning of hybrid autoregressive-diffusion models can be made significantly more stable and effective by averaging gradients across multiple diffusion trajectories and filtering autoregressive tokens for consistency.
Forget global context – ReAlign leverages a stronger VLM to generate *local*, reasoning-guided descriptions that boost visual document retrieval by up to 2%.
Semantic Trimming and Auxiliary Multi-step Prediction (STAMP) slashes the computational cost of Generative Recommendation by up to 38% while simultaneously boosting performance.
LLMs, orchestrated as a team of specialized agents, can autonomously discover and verify zero-day vulnerabilities in real-world software with significantly higher success rates than existing automated exploit generation tools.
Forget multi-agent complexity: a single RL agent can outperform product-level baselines in persona-centric memory management for conversational AI.
Ditch static data paths: TENT dynamically slices and sprays LLM data across heterogeneous interconnects, self-healing in under 50ms and boosting throughput by up to 36%.
E-commerce product understanding gets a boost: MOON3.0 leverages reasoning-aware multimodal learning to outperform existing methods in zero-shot tasks by explicitly modeling fine-grained attributes.
MLLMs struggle to plan coherent interleaved text-and-image generation, often missing opportunities for tool use, revealing a critical gap in their ability to unify factuality with creativity.
Injecting carefully-selected, reverse-ordered behavioral curricula into generative recommendation models can significantly boost conversion rates, as demonstrated by a 2% lift in online advertising revenue.
VLMs struggle to simultaneously optimize for both logical accuracy and aesthetics when generating academic illustrations, a challenge that test-time scaling can significantly alleviate.
Achieve kilometer-scale regional weather forecasts that significantly outperform operational NWP and AI baselines by intelligently coupling global and regional models.
LLMs may ace synthetic benchmarks, but they fumble the efficiency test in real-world cloud service scenarios, revealing a critical gap in their readiness for customer-facing applications.
Forget brittle, overfit skills – Trace2Skill distills diverse execution experiences into transferable agent skills that boost performance by up to 57.65% on unseen tasks, even when transferring skills learned by smaller models to larger ones.
Forget hand-picked genes – Lingshu-Cell models the entire transcriptome to predict cellular responses to perturbations, opening the door to in silico biological discovery.
Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.
Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.
Forget expensive LLM-as-judge checks: Proxy-GRM learns transferable rubrics for vision-language reward models with a lightweight proxy, achieving SOTA results with 4x less data.