Search papers, labs, and topics across Lattice.

Alibaba's global research initiative. Publishes actively on NLP, multimodal models, and AI systems.
75
2
0
Predicting pre-promotion conversions in e-commerce gets a boost with a new model that understands how users "window shop" before sales actually start.
LLMs can now directly predict geographic coordinates with high accuracy, even for vague locations and complex regions, bypassing the need for traditional geocoding pipelines.
TPGO allows multi-agent systems to learn from their own optimization history, leading to unprecedented self-improvement in performance.
Achieve state-of-the-art person re-identification with only 20% of the data by explicitly teaching the model to "think" before matching identities.
LLMs still struggle to reason in context when cultural and linguistic nuances are involved, achieving only 44% accuracy on a new grounded benchmark spanning 14 languages.
Allowing multiple support strategies in a single utterance can dramatically enhance the quality of emotional support conversations, leading to more effective dialogue outcomes.
RL fine-tuning of discrete diffusion models can be made dramatically more stable and effective by treating the final denoised sample as the action and reconstructing trajectories using the forward diffusion process.
Diffusion models are making mistakes because they're losing track of time, but a simple frequency-aware correction can get them back on track.
Skip the costly human annotations: PromptEcho distills reward signals directly from frozen VLMs to boost text-to-image RL, achieving state-of-the-art results without any reward model training.
Stop retraining your diffusion models for every device: OFA-Diffusion lets you extract the right-sized model in a single training run.
Decomposing 4D point cloud videos into spectral frequency bands unlocks superior geometric understanding, boosting performance on action recognition and semantic segmentation.
No single AI model dominates across all professional industries, revealing distinct occupational capability profiles and highlighting the need for specialized AI development.
Ditching the critic doesn't mean sacrificing fine-grained credit assignment: RTMC leverages overlapping states in rollout trees to estimate per-step Q-values, outperforming critic-free baselines on SWE-bench.
Forget prompt engineering: E2E-REME directly generates executable Ansible playbooks from diagnosis reports, outperforming large LLMs in microservice auto-remediation accuracy and efficiency.
Unlock geometric reasoning in MLLMs by parsing diagrams into a unified formal language that spans both 2D and 3D geometry.
Human-like evaluation of long-form generative AI is now possible, thanks to a new framework that breaks down reference answers into weighted, context-aware scoring points.
Continual learning just got a turbo boost: C-Flat Turbo cuts training time by up to 25% without sacrificing accuracy, thanks to a clever gradient-skipping trick.
Unlock zero-shot medical image analysis with MedP-CLIP, a model that understands both the big picture and the critical details, outperforming baselines in tasks from recognition to segmentation.
Unsupervised RL for text generation doesn't have to collapse into gibberish: rewarding relative information gain between specialist and generalist policies unlocks meaningful content creation.
Dense neural networks are choking on sparse recommendation data, but SSR's explicit sparsity unlocks continuous performance gains where dense models saturate.
Forget global context – ReAlign leverages a stronger VLM to generate *local*, reasoning-guided descriptions that boost visual document retrieval by up to 2%.
RL fine-tuning of hybrid autoregressive-diffusion models can be made significantly more stable and effective by averaging gradients across multiple diffusion trajectories and filtering autoregressive tokens for consistency.
Semantic Trimming and Auxiliary Multi-step Prediction (STAMP) slashes the computational cost of Generative Recommendation by up to 38% while simultaneously boosting performance.
LLMs, orchestrated as a team of specialized agents, can autonomously discover and verify zero-day vulnerabilities in real-world software with significantly higher success rates than existing automated exploit generation tools.
Forget multi-agent complexity: a single RL agent can outperform product-level baselines in persona-centric memory management for conversational AI.
Ditch static data paths: TENT dynamically slices and sprays LLM data across heterogeneous interconnects, self-healing in under 50ms and boosting throughput by up to 36%.
E-commerce product understanding gets a boost: MOON3.0 leverages reasoning-aware multimodal learning to outperform existing methods in zero-shot tasks by explicitly modeling fine-grained attributes.
MLLMs struggle to plan coherent interleaved text-and-image generation, often missing opportunities for tool use, revealing a critical gap in their ability to unify factuality with creativity.
Achieve kilometer-scale regional weather forecasts that significantly outperform operational NWP and AI baselines by intelligently coupling global and regional models.
LLMs may ace synthetic benchmarks, but they fumble the efficiency test in real-world cloud service scenarios, revealing a critical gap in their readiness for customer-facing applications.
Injecting carefully-selected, reverse-ordered behavioral curricula into generative recommendation models can significantly boost conversion rates, as demonstrated by a 2% lift in online advertising revenue.
VLMs struggle to simultaneously optimize for both logical accuracy and aesthetics when generating academic illustrations, a challenge that test-time scaling can significantly alleviate.
Forget brittle, overfit skills – Trace2Skill distills diverse execution experiences into transferable agent skills that boost performance by up to 57.65% on unseen tasks, even when transferring skills learned by smaller models to larger ones.
Forget hand-picked genes – Lingshu-Cell models the entire transcriptome to predict cellular responses to perturbations, opening the door to in silico biological discovery.
Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.
Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.
Forget expensive LLM-as-judge checks: Proxy-GRM learns transferable rubrics for vision-language reward models with a lightweight proxy, achieving SOTA results with 4x less data.
Forget scaling reasoning – this work shows that scaling visual perception using code-grounded data is the real key to unlocking MLLMs' STEM abilities.
Transform unstructured audio-visual signals into machine-readable structured knowledge with the Logics-Parsing-Omni model, which enforces strict alignment between high-level semantics and low-level facts.
LLMs can switch between reasoning and factual answering on the fly, without retraining, simply by conditioning on specific token prefixes.
LLMs can generate better recommendations if they pause to verify their reasoning steps, rather than reasoning in one long chain.
Multimodal models are often blind at birth: a new "Visual Attention Score" reveals they struggle to focus on visual inputs during cold-start, but a simple attention-guided fix can boost performance by 7%.
Datacenter networks are haunted by "ghosts"—topology knowledge failures due to link flaps that occur every 48 seconds at 2025 cluster scale—and existing mitigations are insufficient, but Open Atomic Ethernet offers a potential exorcism.
Finally, a CVR prediction dataset with labels from multiple attribution mechanisms, revealing that multi-attribution learning consistently boosts performance, but only with careful architecture and objective selection.
Despite achieving comparable overall scores, top-performing medical LLMs exhibit surprising differences in reasoning, evidence use, and longitudinal follow-up when evaluated on a new Chinese medical benchmark, revealing critical gaps in clinically actionable treatment planning.
Achieve dexterous hand retargeting that's both fast and generalizable by decomposing reinforcement learning policies into finger-specific modules coordinated by a residual network.
An 80B model that runs like a 3B? Qwen3-Coder-Next shows you can get competitive coding agent performance with a fraction of the active parameters, thanks to smart training.
Achieve both long-term scene consistency and precise camera control in world models with UCM, a novel framework sidestepping explicit 3D reconstruction.
Classical Chinese, with its conciseness and obscurity, unlocks a surprisingly effective attack vector against LLM safety filters, and can be automatically exploited via bio-inspired optimization.
LLMs still struggle with PhD-level scanning probe microscopy tasks, but SPM-Bench offers a new automated pipeline to generate challenging scientific benchmarks and quantify model "personalities" like "Conservative" or "Gambler."
LLMs can handle basic route planning, but fall apart when user preferences enter the mix, as shown by a new benchmark based on real-world queries.
Alibaba's FuxiShuffle dynamically adapts to workload and resource fluctuations in ultra-large distributed data processing, slashing job completion times and resource consumption where prior systems falter.
Forget interaction-driven next-item prediction: SIGMA uses instruction-following and semantic grounding to create a generative recommender that adapts to evolving trends and diverse tasks on AliExpress.
Taobao's recommender system just got a 1.65% CTR boost by compressing ultra-long user behavior sequences with a hierarchical codebook and sparse attention, proving that personalized interest centers can be learned efficiently.
LLMs can generate unbiased pseudo-labels for unexposed items in pre-ranking, boosting click-through rate by 3.07% in production while improving diversity.
LLM knowledge distillation and cross-user preference mining can significantly boost search relevance and CTR prediction, even for cold-start users.
LLMs can uncover previously hidden vulnerabilities in database management systems by intelligently fuzzing obscure, system-level features that traditional fuzzers miss.
Taobao's new LTV ranking framework boosts long-term user engagement by learning nuanced video influence and creator-driven re-engagement, all while fitting within existing industrial constraints.
LLMs struggle to understand nuanced values across languages, with accuracy dropping below 77% and varying by over 20% between languages, as revealed by the new X-Value benchmark.
CoT reasoning can hurt recommender performance by drowning out important ID signals – unless you compress reasoning chains and use bias-subtracted contrastive decoding to realign the inference subspace.
Achieve diverse and stylistically consistent long-form piano accompaniments by explicitly planning style at the measure level and retrieving suitable patterns from a corpus.
By unifying contrastive learning with pose-conditioned generative modeling, BindCLIP produces interaction-aware embeddings that substantially improve virtual screening, especially in challenging out-of-distribution scenarios.
Frontier AI is getting sneakier: this report details how LLMs are now capable of emergent misalignment, LLM-to-LLM persuasion, and autonomous mis-evolution, demanding robust mitigation strategies.
Training web agents in a simulator can now match real-world performance: Qwen3-14B, fine-tuned with WebWorld-synthesized trajectories, rivals GPT-4o on WebArena.
LLM benchmark accuracy jumps 10% when evaluated on a cleaned-up version of Humanity's Last Exam, highlighting the significant impact of dataset noise on performance metrics.
A new family of GUI agents, GUI-Owl-1.5, leapfrogs existing open-source models on 20+ GUI benchmarks, proving that multi-platform, real-time GUI automation is now within reach.
Ditch the black-box reward function: this new rubric-based RL framework uses LLMs to judge responses against interpretable criteria, offering a more robust and transparent approach to alignment.
Overcome "intent myopia" in trigger-based recommendations with DAIAN, a network that adaptively learns user intent from click correlations and hybrid ID/semantic similarity, boosting CTR in e-commerce.
RynnBrain leapfrogs existing embodied foundation models, offering a unified, open-source spatiotemporal model that excels at physically grounded reasoning and planning across a wide range of benchmarks.
LLMs can overcome "tunnel vision" in multi-turn search scenarios by using information gain to guide dynamic prompting interventions, leading to more efficient and accurate reasoning.
Key contribution not extracted.
Forget huge models: parameter-efficient fine-tuning turns tiny language models into code-generating powerhouses that outperform larger, untuned counterparts.
Failure-driven post-training, combined with a meticulously curated 10M token STEM dataset, unlocks a 4.68% performance boost in LLM reasoning, proving that strategic data synthesis around model weaknesses is a powerful path to improvement.
LLM safety guardrails are far less robust than benchmarks suggest, with accuracy dropping by as much as 57% on novel adversarial attacks, and some even generating harmful content in a "helpful mode" jailbreak.
ToolRMs drastically improve tool-use accuracy in LLMs, outperforming existing models by up to 17.94%, while also reducing output token usage by over 66% through efficient inference-time scaling.