Search papers, labs, and topics across Lattice.

Alibaba's global research initiative. Publishes actively on NLP, multimodal models, and AI systems.
100
3
0
CapRL++ redefines caption quality through utility, enabling models to produce high-fidelity descriptions without the constraints of traditional supervised fine-tuning.
LexRubric reveals that even state-of-the-art LLMs struggle with open-ended legal tasks, exposing critical gaps in their contextual understanding and reasoning abilities.
Achieving high-quality text-to-speech synthesis without intermediate representations, BareWave shows that direct waveform generation can rival traditional methods in intelligibility and naturalness.
Z-Reward achieves nearly 90% human preference accuracy by transforming subjective visual preferences into nuanced score distributions, outperforming traditional reward models.
Achieving a 20.24 percentage-point improvement in task success rates while slashing inference costs by 92% could revolutionize automated documentation verification in cloud environments.
Leveraging historical solving traces transforms software engineering agents into self-evolving entities, achieving a 50.40% success rate on SWE-bench Verified after just three iterations.
QueryAgent-R1 achieves a remarkable 2.9% boost in query CTR and 3.1% in product conversion rates by aligning query generation with actual product retrieval.
FRAP achieves substantial improvements in performance estimation under distribution shifts by effectively merging the strengths of foundation and base models.
Streaming reasoning steps can boost multi-agent system performance by 7.3 percentage points on average, revealing a new dimension for scaling effectiveness and efficiency.
Skill-RM achieves superior performance by dynamically orchestrating diverse evaluation criteria, reshaping how we approach reward modeling in AI.
Video diffusion models can achieve superior human motion control by leveraging 3D mesh tokenization, revealing a deeper understanding of 3D structures than previously thought.
Harness-1's innovative use of externalized state management leads to an 11.4 point increase in retrieval performance, setting a new standard for search agents.
Seamless transitions between speech and singing modes are now driven purely by text context, achieving state-of-the-art results in code-switching synthesis.
Tool-augmented multimodal agents may appear to excel, but they often rely on learned tool-calling patterns rather than enhanced problem-solving abilities.
Forget static rubrics and expensive external models: EvoRubric co-evolves a single policy to generate both responses and the rubrics to evaluate them, outperforming traditional RLHF methods in open-ended generation tasks.
Early-stopping can save over 20% of compute while improving reasoning accuracy in large language models.
A single feed-forward transformer now achieves state-of-the-art performance across diverse video geometry estimation tasks, rivaling specialized architectures.
Forget complex architectures: a simple transformer can generate metric-accurate dense depth maps from sparse observations, outperforming existing methods.
LLMs can learn to synthesize data more effectively by accumulating and transferring experience across a stream of sequential synthesis tasks, opening the door to more efficient and adaptable synthetic data generation.
Freezing a Sparse Autoencoder's encoder creates a reusable "safety dictionary" that generalizes to new risks in text-to-image diffusion models, offering a more robust alternative to fixed-layer steering.
LLM memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment, and can be automatically corrected with prompt optimization guided by fine-grained error tracing.
Autonomous agents struggle to retain instructions when burdened with retrieving information from the open web, exposing a critical retrieval-reasoning trade-off.
Closing the sim-to-real gap in vision-language navigation requires benchmarks grounded in realistic 3D reconstructions, not just generated scenes.
Unlock a treasure trove of free training data: SIGMA turns millions of unannotated image edits into high-quality pixel masks, boosting image manipulation detection by 18%.
DiT activations are far more amenable to semi-structured sparsity than weights, unlocking significant inference speedups without sacrificing generation quality.
Uniboost decouples complex weighting schemes in recommendation systems, enabling precise attribution of each traffic allocation plan's contribution and boosting overall efficiency.
Stop wasting compute on redundant retrieval calls: DynFrame learns the optimal frame sampling density *within* each temporal window, slashing context length and boosting performance on complex video understanding tasks.
Long video generation fails not just because of limited context length, but because of *how* that context is allocated – and ReCA's hierarchical approach shows a way to fix it.
Scribble-guided image editing struggles more with understanding diverse instructions than with handling real-world images, a counterintuitive finding that unlocks significant performance gains.
By distilling lookahead planning into a lightweight generator, DeGRe achieves state-of-the-art recommendation reranking with a single, efficient greedy decoding pass.
By explicitly conditioning on the query, QGS achieves a 0.62% CTR increase in a major commercial search engine, proving that generative models can beat traditional deep learning baselines in search ranking when query context is properly handled.
Forget static hyperparameters: DVAO dynamically adjusts reward weights based on variance, leading to more stable and effective multi-objective RLHF.
Knowledge injection, reasoning supervision, and preference optimization can be combined to substantially improve semantic relevance judgment, outperforming even strong LLM baselines.
Pathology foundation models excel at clinical tasks, but SpaPath-Bench reveals they differ significantly in their ability to capture the spatial organization of tissues, highlighting a critical gap for spatially-aware applications.
Forget hand-crafted benchmarks: CUA-Gym's auto-generated training data lets computer-use agents crush existing open-source models on real-world tasks.
StreamChar achieves real-time audio-video generation with unprecedented fidelity and synchronization, overcoming the limitations of traditional autoregressive models.
Overweighting easy-to-reconstruct features in generative CTR models is leaving performance on the table, especially for cold-start and long-tail users.
Get 5.3% more clicks by intelligently scaling your CTR model's inference depth only when it's uncertain, without retraining or increasing worst-case latency.
LLMs can now autonomously engineer features as executable code, outperforming traditional methods and unlocking significant gains in real-world cloud resource optimization.
Frontier LLMs still struggle with preference coverage and group fairness when planning travel for multiple users, revealing a critical gap in real-world agent capabilities.
Forget maps: LLMs can learn end-to-end transit route planning directly from data, even grounding GPS coordinates without explicit mapping.
Forget expensive data curation: a simple, training-free entropy metric lets you train LLMs on just 20% of your reasoning data without sacrificing performance.
DiTs are leaving performance on the table by using vanilla residual connections, and a simple timestep-adaptive routing mechanism can unlock significant gains in both training efficiency and final image quality.
Multimodal LLMs struggle to pinpoint objects from nouns alone, but SWIM training realigns vision and language to outperform visual-prompt methods.
Full-attention LLMs are intrinsically sparse and can be transformed into highly efficient sparse models with minimal training, sidestepping the need for expensive sparse pre-training.
Technical artists overwhelmingly prefer this new method for single-image head mesh reconstruction, finding it closest to industry-grade usability.
On-device LLMs can now drive real-time recommendation improvements, unlocking faster adaptation to evolving user intent without cloud reliance.
Fine-tuning efficient few-step diffusion models no longer requires sacrificing their speed, thanks to a self-distillation approach that preserves inference capabilities.
Despite impressive OCR performance on existing benchmarks, today's best LMMs still struggle with the messy realities of enterprise document processing.
Today's visual generation models are often evaluated on the wrong things, leading to inflated performance claims that mask critical failures in spatial reasoning, temporal consistency, and causal understanding.
Transfer learning can unlock scalable emission control across diverse waste incineration plants by learning transferable system-level structures that capture physical constraints, operating-regime heterogeneity, and carbon-pollutant coupling.
Multilingual MoEs can achieve best-in-class performance-to-compute ratios, even with extreme sparsity, by strategically upcycling from dense models and exhibiting structured expert activation patterns across languages.
Multi-agent LLM systems are leaving performance on the table by treating structured agent interactions as generic traffic; Pythia shows how to unlock substantial gains by exploiting workflow semantics at the serving layer.
Vanilla on-policy distillation falls apart in multi-turn settings due to compounding errors, but a simple curriculum on trajectory length fixes it, even letting students beat their teachers.
Predicting pre-promotion conversions in e-commerce gets a boost with a new model that understands how users "window shop" before sales actually start.
LLMs can now directly predict geographic coordinates with high accuracy, even for vague locations and complex regions, bypassing the need for traditional geocoding pipelines.
TPGO allows multi-agent systems to learn from their own optimization history, leading to unprecedented self-improvement in performance.
LLMs still struggle to reason in context when cultural and linguistic nuances are involved, achieving only 44% accuracy on a new grounded benchmark spanning 14 languages.
Achieve state-of-the-art person re-identification with only 20% of the data by explicitly teaching the model to "think" before matching identities.
Allowing multiple support strategies in a single utterance can dramatically enhance the quality of emotional support conversations, leading to more effective dialogue outcomes.
RL fine-tuning of discrete diffusion models can be made dramatically more stable and effective by treating the final denoised sample as the action and reconstructing trajectories using the forward diffusion process.
Diffusion models are making mistakes because they're losing track of time, but a simple frequency-aware correction can get them back on track.
Stop retraining your diffusion models for every device: OFA-Diffusion lets you extract the right-sized model in a single training run.
Skip the costly human annotations: PromptEcho distills reward signals directly from frozen VLMs to boost text-to-image RL, achieving state-of-the-art results without any reward model training.
Ditching the critic doesn't mean sacrificing fine-grained credit assignment: RTMC leverages overlapping states in rollout trees to estimate per-step Q-values, outperforming critic-free baselines on SWE-bench.
Forget prompt engineering: E2E-REME directly generates executable Ansible playbooks from diagnosis reports, outperforming large LLMs in microservice auto-remediation accuracy and efficiency.
Unlock geometric reasoning in MLLMs by parsing diagrams into a unified formal language that spans both 2D and 3D geometry.
Human-like evaluation of long-form generative AI is now possible, thanks to a new framework that breaks down reference answers into weighted, context-aware scoring points.
Continual learning just got a turbo boost: C-Flat Turbo cuts training time by up to 25% without sacrificing accuracy, thanks to a clever gradient-skipping trick.
No single AI model dominates across all professional industries, revealing distinct occupational capability profiles and highlighting the need for specialized AI development.
Unlock zero-shot medical image analysis with MedP-CLIP, a model that understands both the big picture and the critical details, outperforming baselines in tasks from recognition to segmentation.
Decomposing 4D point cloud videos into spectral frequency bands unlocks superior geometric understanding, boosting performance on action recognition and semantic segmentation.
Unsupervised RL for text generation doesn't have to collapse into gibberish: rewarding relative information gain between specialist and generalist policies unlocks meaningful content creation.
Dense neural networks are choking on sparse recommendation data, but SSR's explicit sparsity unlocks continuous performance gains where dense models saturate.
RL fine-tuning of hybrid autoregressive-diffusion models can be made significantly more stable and effective by averaging gradients across multiple diffusion trajectories and filtering autoregressive tokens for consistency.
Forget global context – ReAlign leverages a stronger VLM to generate *local*, reasoning-guided descriptions that boost visual document retrieval by up to 2%.
Semantic Trimming and Auxiliary Multi-step Prediction (STAMP) slashes the computational cost of Generative Recommendation by up to 38% while simultaneously boosting performance.
LLMs, orchestrated as a team of specialized agents, can autonomously discover and verify zero-day vulnerabilities in real-world software with significantly higher success rates than existing automated exploit generation tools.
Forget multi-agent complexity: a single RL agent can outperform product-level baselines in persona-centric memory management for conversational AI.
E-commerce product understanding gets a boost: MOON3.0 leverages reasoning-aware multimodal learning to outperform existing methods in zero-shot tasks by explicitly modeling fine-grained attributes.
Ditch static data paths: TENT dynamically slices and sprays LLM data across heterogeneous interconnects, self-healing in under 50ms and boosting throughput by up to 36%.
MLLMs struggle to plan coherent interleaved text-and-image generation, often missing opportunities for tool use, revealing a critical gap in their ability to unify factuality with creativity.
Injecting carefully-selected, reverse-ordered behavioral curricula into generative recommendation models can significantly boost conversion rates, as demonstrated by a 2% lift in online advertising revenue.
LLMs may ace synthetic benchmarks, but they fumble the efficiency test in real-world cloud service scenarios, revealing a critical gap in their readiness for customer-facing applications.
VLMs struggle to simultaneously optimize for both logical accuracy and aesthetics when generating academic illustrations, a challenge that test-time scaling can significantly alleviate.
Achieve kilometer-scale regional weather forecasts that significantly outperform operational NWP and AI baselines by intelligently coupling global and regional models.
Forget hand-picked genes – Lingshu-Cell models the entire transcriptome to predict cellular responses to perturbations, opening the door to in silico biological discovery.
Forget brittle, overfit skills – Trace2Skill distills diverse execution experiences into transferable agent skills that boost performance by up to 57.65% on unseen tasks, even when transferring skills learned by smaller models to larger ones.
Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.
Forget expensive LLM-as-judge checks: Proxy-GRM learns transferable rubrics for vision-language reward models with a lightweight proxy, achieving SOTA results with 4x less data.
Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.
Forget scaling reasoning – this work shows that scaling visual perception using code-grounded data is the real key to unlocking MLLMs' STEM abilities.
Transform unstructured audio-visual signals into machine-readable structured knowledge with the Logics-Parsing-Omni model, which enforces strict alignment between high-level semantics and low-level facts.
LLMs can switch between reasoning and factual answering on the fly, without retraining, simply by conditioning on specific token prefixes.
LLMs can generate better recommendations if they pause to verify their reasoning steps, rather than reasoning in one long chain.
Multimodal models are often blind at birth: a new "Visual Attention Score" reveals they struggle to focus on visual inputs during cold-start, but a simple attention-guided fix can boost performance by 7%.
Datacenter networks are haunted by "ghosts"—topology knowledge failures due to link flaps that occur every 48 seconds at 2025 cluster scale—and existing mitigations are insufficient, but Open Atomic Ethernet offers a potential exorcism.
Finally, a CVR prediction dataset with labels from multiple attribution mechanisms, revealing that multi-attribution learning consistently boosts performance, but only with careful architecture and objective selection.
Despite achieving comparable overall scores, top-performing medical LLMs exhibit surprising differences in reasoning, evidence use, and longitudinal follow-up when evaluated on a new Chinese medical benchmark, revealing critical gaps in clinically actionable treatment planning.
Achieve dexterous hand retargeting that's both fast and generalizable by decomposing reinforcement learning policies into finger-specific modules coordinated by a residual network.