Alibaba DAMO Academy

Predicting pre-promotion conversions in e-commerce gets a boost with a new model that understands how users "window shop" before sales actually start.

Kaiyuan Li

Natural Language Processing Recommendation & Information Retrieval

DAMO2d ago

ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

LLMs can now directly predict geographic coordinates with high accuracy, even for vague locations and complex regions, bypassing the need for traditional geocoding pipelines.

Gong Wenbin

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Apr 22, 2026

DAMO3d ago

Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization

TPGO allows multi-agent systems to learn from their own optimization history, leading to unprecedented self-improvement in performance.

Shan He, Runze Wang, Zhuoyun Du +4

Natural Language Processing Tool Use & Agents Training Efficiency & Optimization

Apr 21, 2026

4d ago·also DAMO

Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification

Achieve state-of-the-art person re-identification with only 20% of the data by explicitly teaching the model to "think" before matching identities.

Quan Zhang, Jingze Wu, Xiaohua Xie +2

Computer Vision Reasoning & Chain-of-Thought

Beijing Language and Culture University4d ago·also DAMO, ELLIS, HIT, IBM Research +4

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

LLMs still struggle to reason in context when cultural and linguistic nuances are involved, achieving only 44% accuracy on a new grounded benchmark spanning 14 languages.

Wenjiang Luo, Haotian Ye, Md Mehrab Hossain +16

Eval Frameworks & Benchmarks Natural Language Processing

Apr 20, 2026

DAMO5d ago·also HIT, Humanoid Robot (Shanghai) Co., School of Informatics, Soochow

Modeling Multiple Support Strategies within a Single Turn for Emotional Support Conversations

Allowing multiple support strategies in a single utterance can dramatically enhance the quality of emotional support conversations, leading to more effective dialogue outcomes.

Jie Zhu, Huaixia Dou, Junhui Li +5

Natural Language Processing

5d ago·also DAMO, Tsinghua AI, BUPT

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

RL fine-tuning of discrete diffusion models can be made dramatically more stable and effective by treating the final denoised sample as the action and reconstructing trajectories using the forward diffusion process.

Jiaqi Wang, Haoge Deng, Ting Pan +10

Architecture Design (Transformers, SSMs, MoE)Computer Vision RLHF & Preference Learning+1

Apr 17, 2026

DAMO1w ago

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

Diffusion models are making mistakes because they're losing track of time, but a simple frequency-aware correction can get them back on track.

Computer Vision

Apr 14, 2026

DAMO1w ago·also ZJU

PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

Skip the costly human annotations: PromptEcho distills reward signals directly from frozen VLMs to boost text-to-image RL, achieving state-of-the-art results without any reward model training.

Wanggui He, Mushui Liu, Hao Jiang +1

Computer Vision Multimodal Models RLHF & Preference Learning

DAMO1w ago·also Independent researcher, RUC, Tencent AI

OFA-Diffusion Compression: Compressing Diffusion Model in One-Shot Manner

Stop retraining your diffusion models for every device: OFA-Diffusion lets you extract the right-sized model in a single training run.

Haoyang Jiang, Mingyang Yi, Xiuyu Li +4

Computer Vision Inference & Quantization Training Efficiency & Optimization

Apr 13, 2026

DAMO1w ago·also CAS, NTU, ZJU

STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding

Decomposing 4D point cloud videos into spectral frequency bands unlocks superior geometric understanding, boosting performance on action recognition and semantic segmentation.

Xueying Jiang, Gongjie Zhang, Xiaoqin Zhang +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision

DAMO1w ago·also RUC

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

No single AI model dominates across all professional industries, revealing distinct occupational capability profiles and highlighting the need for specialized AI development.

Fei Huang, Jianhong Tu, Yang Su

Eval Frameworks & Benchmarks Tool Use & Agents World Models & Planning

1w ago·also DAMO

RTMC: Step-Level Credit Assignment via Rollout Trees

Ditching the critic doesn't mean sacrificing fine-grained credit assignment: RTMC leverages overlapping states in rollout trees to estimate per-step Q-values, outperforming critic-free baselines on SWE-bench.

Tao Wang, Suhang Zheng, Xiaoxiao Xu

Reasoning & Chain-of-Thought Tool Use & Agents World Models & Planning

DAMO1w ago·also Microsoft Research, BIT, HKUST, PKU

E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning

Forget prompt engineering: E2E-REME directly generates executable Ansible playbooks from diagnosis reports, outperforming large LLMs in microservice auto-remediation accuracy and efficiency.

Lingzhe Zhang, Minghua He, Zhaoyang Liu +1

Code Generation & Program Synthesis Distributed Systems & Hardware Tool Use & Agents

1w ago·also DAMO

Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language

Unlock geometric reasoning in MLLMs by parsing diagrams into a unified formal language that spans both 2D and 3D geometry.

Peijie Wang, Ming-Liang Zhang, Jun Cao +8

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Pengcheng Laboratory1w ago·also DAMO, CAS, CUHK, Fudan

Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

Human-like evaluation of long-form generative AI is now possible, thanks to a new framework that breaks down reference answers into weighted, context-aware scoring points.

Guoxin Yu, Chulun Zhou, Lemao Liu +6

Eval Frameworks & Benchmarks Natural Language Processing

DAMO1w ago·also ETH, Tsinghua AI, NJU, NTU +1

A Faster Path to Continual Learning

Continual learning just got a turbo boost: C-Flat Turbo cuts training time by up to 25% without sacrificing accuracy, thanks to a clever gradient-skipping trick.

Wei Li, Borui Kang, Ziwei Liu

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

DAMO1w ago·also SCU, Shanghai AI Lab, XJU

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

Unlock zero-shot medical image analysis with MedP-CLIP, a model that understands both the big picture and the critical details, outperforming baselines in tasks from recognition to segmentation.

Jiahui Peng, He Yao, Jingwen Li +9

Computer Vision Multimodal Models Scientific Discovery & Drug Design

DAMO1w ago·also PKU

Triviality Corrected Endogenous Reward

Unsupervised RL for text generation doesn't have to collapse into gibberish: rewarding relative information gain between specialist and generalist policies unlocks meaningful content creation.

Xinda Wang, Zhengxu Hou, Yangshijie Zhang +6

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

Apr 9, 2026

DAMO2w ago·also Xiaomi EV

Beyond Dense Connectivity: Explicit Sparsity for Scalable Recommendation

Dense neural networks are choking on sparse recommendation data, but SSR's explicit sparsity unlocks continuous performance gains where dense models saturate.

Lei Shen, Bing Wang

Architecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval Training Efficiency & Optimization

Apr 8, 2026

2w ago·also DAMO, Tsinghua AI, Fudan

ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

Forget global context – ReAlign leverages a stronger VLM to generate *local*, reasoning-guided descriptions that boost visual document retrieval by up to 2%.

Yifan Ji, Zhipeng Xu, Zhenghao Liu +4

Computer Vision Multimodal Models Recommendation & Information Retrieval

2w ago·also DAMO, ZJU

MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

RL fine-tuning of hybrid autoregressive-diffusion models can be made significantly more stable and effective by averaging gradients across multiple diffusion trajectories and filtering autoregressive tokens for consistency.

Xiaoxiao Ma, Tianfei Ren, Jie Huang +2

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Apr 7, 2026

2w ago·also DAMO

Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation

Semantic Trimming and Auxiliary Multi-step Prediction (STAMP) slashes the computational cost of Generative Recommendation by up to 38% while simultaneously boosting performance.

Tianyu Zhan, Kairui Fu, Zheqi Lv +1

Natural Language Processing Recommendation & Information Retrieval Training Efficiency & Optimization

Apr 6, 2026

DAMO2w ago·also Aarhus University, WHU

A Multi-Agent Framework for Automated Exploit Generation with Constraint-Guided Comprehension and Reflection

LLMs, orchestrated as a team of specialized agents, can autonomously discover and verify zero-day vulnerabilities in real-world software with significantly higher success rates than existing automated exploit generation tools.

Yilin Zhou, Wenyuan Xu

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

Apr 2, 2026

DAMO3w ago

DeltaMem: Towards Agentic Memory Management via Reinforcement Learning

Forget multi-agent complexity: a single RL agent can outperform product-level baselines in persona-centric memory management for conversational AI.

Shen Huang, Chu Liu, Shouqing Yang +1

RLHF & Preference Learning Tool Use & Agents

Apr 1, 2026

Independent Researcher3w ago·also DAMO, Tsinghua AI, Ant Group, Moonshot +1

TENT: A Declarative Slice Spraying Engine for Performant and Resilient Data Movement in Disaggregated LLM Serving

Ditch static data paths: TENT dynamically slices and sprays LLM data across heterogeneous interconnects, self-healing in under 50ms and boosting throughput by up to 36%.

Yineng Zhang, Yuhao Fu, Mingxing Zhang

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization+1

DAMO3w ago

MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

E-commerce product understanding gets a boost: MOON3.0 leverages reasoning-aware multimodal learning to outperform existing methods in zero-shot tasks by explicitly modeling fine-grained attributes.

Multimodal Models Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Mar 31, 2026

DAMO3w ago

ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

MLLMs struggle to plan coherent interleaved text-and-image generation, often missing opportunities for tool use, revealing a critical gap in their ability to unify factuality with creativity.

Yinuo Liu, Jiahao Zhang

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Mar 30, 2026

Northwest Polytechnical University3w ago·also DAMO, CAS, ZJU

Skillful Kilometer-Scale Regional Weather Forecasting via Global and Regional Coupling

Achieve kilometer-scale regional weather forecasts that significantly outperform operational NWP and AI baselines by intelligently coupling global and regional models.

Qilong Yuan, Lefei Shen, Bo Wu +1

Scientific Discovery & Drug Design

DAMO3w ago·also Fudan

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

LLMs may ace synthetic benchmarks, but they fumble the efficiency test in real-world cloud service scenarios, revealing a critical gap in their readiness for customer-facing applications.

Guangquan Hu, Chenghuang Shen, Xingyan Liu +4

Eval Frameworks & Benchmarks Tool Use & Agents

DAMO3w ago

RCLRec: Reverse Curriculum Learning for Modeling Sparse Conversions in Generative Recommendation

Injecting carefully-selected, reverse-ordered behavioral curricula into generative recommendation models can significantly boost conversion rates, as demonstrated by a 2% lift in online advertising revenue.

Data Curation & Synthetic Data Recommendation & Information Retrieval Training Efficiency & Optimization

DAMO3w ago

AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

VLMs struggle to simultaneously optimize for both logical accuracy and aesthetics when generating academic illustrations, a challenge that test-time scaling can significantly alleviate.

Quanhao Li, Hong-Tao Yu, Zhen Xing +1

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Mar 26, 2026

ETHMar 26, 2026·also DAMO

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Forget brittle, overfit skills – Trace2Skill distills diverse execution experiences into transferable agent skills that boost performance by up to 57.65% on unseen tasks, even when transferring skills learned by smaller models to larger ones.

Robotics & Embodied AI Tool Use & Agents Training Efficiency & Optimization

DAMOMar 26, 2026

Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells

Forget hand-picked genes – Lingshu-Cell models the entire transcriptome to predict cellular responses to perturbations, opening the door to in silico biological discovery.

Scientific Discovery & Drug Design World Models & Planning

Mar 18, 2026

Tsinghua AIMar 18, 2026·also DAMO, ZJU

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Forget real-world video datasets: training VLMs on just 7.7K synthetic videos with temporal primitives beats 165K real-world examples, unlocking surprisingly effective transfer learning for video reasoning.

Songtao Jiang, Sibo Song, Chenyi Zhou +10

Computer Vision Data Curation & Synthetic Data Multimodal Models

Mar 17, 2026

Tsinghua AIMar 17, 2026·also DAMO, Fudan

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Multi-hop data synthesis using HopChain boosts VLM performance across a wide range of tasks, with gains of over 50 points in accuracy for ultra-long-context reasoning.

Shenzhi Wang, Shixuan Liu, Jing Zhou +8

Data Curation & Synthetic Data Multimodal Models Reasoning & Chain-of-Thought

Mar 17, 2026·also DAMO, CAS

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

Forget expensive LLM-as-judge checks: Proxy-GRM learns transferable rubrics for vision-language reward models with a lightweight proxy, achieving SOTA results with 4x less data.

Weijie Qiu, Dai Guan, Junxin Wang

Multimodal Models RLHF & Preference Learning

Mar 11, 2026

Tsinghua AIMar 11, 2026·also DAMO, NanKai University, NJU, Scale +1

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Forget scaling reasoning – this work shows that scaling visual perception using code-grounded data is the real key to unlocking MLLMs' STEM abilities.

Tongkun Guan, Zhibo Yang, Jianqiang Wan +13

Code Generation & Program Synthesis Multimodal Models Reasoning & Chain-of-Thought

Mar 10, 2026

DAMOMar 10, 2026

Logics-Parsing-Omni Technical Report

Transform unstructured audio-visual signals into machine-readable structured knowledge with the Logics-Parsing-Omni model, which enforces strict alignment between high-level semantics and low-level facts.

Computer Vision Multimodal Models Speech & Audio

Mar 9, 2026

Tsinghua AIMar 9, 2026·also DAMO, HIT

Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

LLMs can switch between reasoning and factual answering on the fly, without retraining, simply by conditioning on specific token prefixes.

Liyuan Mao, Le Yu, Jing Zhou +8

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

Mar 8, 2026

NUSMar 8, 2026·also DAMO

Verifiable Reasoning for LLM-based Generative Recommendation

LLMs can generate better recommendations if they pause to verify their reasoning steps, rather than reasoning in one long chain.

Natural Language Processing Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Mar 4, 2026

Tsinghua AIMar 4, 2026·also DAMO, ZJU

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Multimodal models are often blind at birth: a new "Visual Attention Score" reveals they struggle to focus on visual inputs during cold-start, but a simple attention-guided fix can boost performance by 7%.

Chufan Shi, Yizhen Zhang, Ruizhe Chen +3

Computer Vision Interpretability & Mechanistic Interp Multimodal Models+1

Mar 4, 2026·also DAMO, Google Research, Meta AI

The Ghost in the Datacenter: Link Flapping, Topology Knowledge Failures, and the FITO Category Mistake

Datacenter networks are haunted by "ghosts"—topology knowledge failures due to link flaps that occur every 48 seconds at 2025 cluster scale—and existing mitigations are insufficient, but Open Atomic Ethernet offers a potential exorcism.

Paul Borrill, P. Borrill

Distributed Systems & Hardware

Mar 2, 2026

DAMOMar 2, 2026

MAC: A Conversion Rate Prediction Benchmark Featuring Labels Under Multiple Attribution Mechanisms

Finally, a CVR prediction dataset with labels from multiple attribution mechanisms, revealing that multi-attribution learning consistently boosts performance, but only with careful architecture and objective selection.

Jinqi Wu∗

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Recommendation & Information Retrieval

DAMOMar 2, 2026·also Cornell, Kuaishou

ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Despite achieving comparable overall scores, top-performing medical LLMs exhibit surprising differences in reasoning, evidence use, and longitudinal follow-up when evaluated on a new Chinese medical benchmark, revealing critical gaps in clinically actionable treatment planning.

Xiang Zheng, Han Li, Wenjie Luo +5

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Mar 1, 2026

Tsinghua AIMar 1, 2026·also DAMO, Fudan, Kuaishou

MoReL: A Generalizable Framework for Dexterous Hand Retargeting via Modular Residual Reinforcement Learning

Achieve dexterous hand retargeting that's both fast and generalizable by decomposing reinforcement learning policies into finger-specific modules coordinated by a residual network.

Zhenghan Wang, Yongkang Luo, Dashun Yan +5

Robotics & Embodied AI Training Efficiency & Optimization

Feb 28, 2026

Tsinghua AIFeb 28, 2026·also DAMO, CAS, PolyU, RUC +1

Qwen3-Coder-Next Technical Report

An 80B model that runs like a 3B? Qwen3-Coder-Next shows you can get competitive coding agent performance with a fraction of the active parameters, thanks to smart training.

Ruisheng Cao, Mouxiang Chen, Jiawei Chen +17

Code Generation & Program Synthesis Inference & Quantization Tool Use & Agents

Feb 26, 2026

DAMOFeb 26, 2026·also Tsinghua AI, USTC

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

Achieve both long-term scene consistency and precise camera control in world models with UCM, a novel framework sidestepping explicit 3D reconstruction.

Tianxing Xu, Zixuan Wang, Guangyuan Wang +5

Computer Vision Robotics & Embodied AI World Models & Planning

Feb 26, 2026·also DAMO, NUS, Tsinghua AI, Beihang +6

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Classical Chinese, with its conciseness and obscurity, unlocks a surprisingly effective attack vector against LLM safety filters, and can be automatically exploited via bio-inspired optimization.

Xun Huang, Simeng Qin, Simeng Qin +10

Natural Language Processing Red-Teaming & Adversarial Robustness

Feb 26, 2026·also DAMO, Skylenage

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

LLMs still struggle with PhD-level scanning probe microscopy tasks, but SPM-Bench offers a new automated pipeline to generate challenging scientific benchmarks and quantify model "personalities" like "Conservative" or "Gambler."

Peiyao Xiao, P. Xiao, Xiaogang Li +12

Eval Frameworks & Benchmarks Multimodal Models Scientific Discovery & Drug Design

DAMOFeb 26, 2026·also Baidu, CAS, USTC

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

LLMs can handle basic route planning, but fall apart when user preferences enter the mix, as shown by a new benchmark based on real-world queries.

Zhiheng Song, Zhiheng Song, Jingshuai Zhang +7

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

DAMOFeb 26, 2026·also Bairong Inc., School of Information Science and Technology

FuxiShuffle: An Adaptive and Resilient Shuffle Service for Distributed Data Processing on Alibaba Cloud

Alibaba's FuxiShuffle dynamically adapts to workload and resource fluctuations in ultra-large distributed data processing, slashing job completion times and resource consumption where prior systems falter.

Yuhao Lin, Zhipeng Tang, Jia Tong +10

Distributed Systems & Hardware Training Efficiency & Optimization

DAMOFeb 26, 2026·also AliExpress, Xiaomi EV

SIGMA: A Semantic-Grounded Instruction-Driven Generative Multi-Task Recommender at AliExpress

Forget interaction-driven next-item prediction: SIGMA uses instruction-following and semantic grounding to create a generative recommender that adapts to evolving trends and diverse tasks on AliExpress.

Yang Yu, Yang Yu, Lei Kou +9

Natural Language Processing Recommendation & Information Retrieval Tool Use & Agents

Feb 24, 2026

DAMOFeb 24, 2026·also ByteDance, Taobao

HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders

Taobao's recommender system just got a 1.65% CTR boost by compressing ultra-long user behavior sequences with a hierarchical codebook and sparse attention, proving that personalized interest centers can be learned efficiently.

Kun Yuan, Junyu Bi, Junyu Bi +4

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Recommendation & Information Retrieval

DAMOFeb 24, 2026·also ByteDance

Generative Pseudo-Labeling for Pre-Ranking with LLMs

LLMs can generate unbiased pseudo-labels for unexposed items in pre-ranking, boosting click-through rate by 3.07% in production while improving diversity.

Junyu Bi, Junyu Bi, Xinting Niu +4

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

DAMOFeb 24, 2026·also SJTU

PRECTR-V2:Unified Relevance-CTR Framework with Cross-User Preference Mining, Exposure Bias Correction, and LLM-Distilled Encoder Optimization

LLM knowledge distillation and cross-user preference mining can significantly boost search relevance and CTR prediction, even for cold-start users.

Shuzhi Cao, Ailong He, Shuguang Han +1

Natural Language Processing Recommendation & Information Retrieval Training Efficiency & Optimization

Feb 23, 2026

Feb 23, 2026·also DAMO, Tsinghua AI, Hunan, National Technology Innovation Center

FuzzySQL: Uncovering Hidden Vulnerabilities in DBMS Special Features with LLM-Driven Fuzzing

LLMs can uncover previously hidden vulnerabilities in database management systems by intelligently fuzzing obscure, system-level features that traditional fuzzers miss.

Yongxin Chen, Zhiyuan Jiang, Zhiyuan Jiang +11

Code Generation & Program Synthesis Natural Language Processing Red-Teaming & Adversarial Robustness

Feb 19, 2026

DAMOFeb 19, 2026·also Tsinghua AI, Taobao

A Long-term Value Prediction Framework In Video Ranking

Taobao's new LTV ranking framework boosts long-term user engagement by learning nuanced video influence and creator-driven re-engagement, all while fitting within existing industrial constraints.

Huabin Chen, Xinao Wang, Huiping Chu +4

Recommendation & Information Retrieval

Feb 19, 2026·also DAMO, HIT, ZJU

Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective

LLMs struggle to understand nuanced values across languages, with accuracy dropping below 77% and varying by over 20% between languages, as revealed by the new X-Value benchmark.

Yukun Chen, Jialong Tang, Yiming Li

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Feb 18, 2026

Feb 18, 2026·also DAMO, Stevens

Why Thinking Hurts? Diagnosing and Rectifying the Reasoning Shift in Foundation Recommender Models

CoT reasoning can hurt recommender performance by drowning out important ID signals – unless you compress reasoning chains and use bias-subtracted contrastive decoding to realign the inference subspace.

Luankang Zhang, Yonghao Huang, Hang Lv +6

Inference & Quantization Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Feb 16, 2026

DAMOFeb 16, 2026

Structure-Aware Piano Accompaniment via Style Planning and Dataset-Aligned Pattern Retrieval

Achieve diverse and stylistically consistent long-form piano accompaniments by explicitly planning style at the measure level and retrieving suitable patterns from a corpus.

Wanyu Zang, Yang Yu

Architecture Design (Transformers, SSMs, MoE)Recommendation & Information Retrieval Speech & Audio

Feb 16, 2026·also DAMO

BindCLIP: A Unified Contrastive-Generative Representation Learning Framework for Virtual Screening

By unifying contrastive learning with pose-conditioned generative modeling, BindCLIP produces interaction-aware embeddings that substantially improve virtual screening, especially in challenging out-of-distribution scenarios.

Anjie Qiao, Yaliang Li, Jiahua Rao

Multimodal Models Recommendation & Information Retrieval Scientific Discovery & Drug Design

Tsinghua AIFeb 16, 2026·also DAMO, Beihang, Fudan, NTU +1

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

Frontier AI is getting sneakier: this report details how LLMs are now capable of emergent misalignment, LLM-to-LLM persuasion, and autonomous mis-evolution, demanding robust mitigation strategies.

Dongrui Liu, Yi Yu, Jie Zhang +28

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

DAMOFeb 16, 2026·also Tsinghua AI, School of Information Science and Technology

WebWorld: A Large-Scale World Model for Web Agent Training

Training web agents in a simulator can now match real-world performance: Qwen3-14B, fine-tuned with WebWorld-synthesized trajectories, rivals GPT-4o on WebArena.

Zikai Xiao, Jianhong Tu, Chuhang Zou +2

Data Curation & Synthetic Data Tool Use & Agents World Models & Planning

Feb 15, 2026

DAMOFeb 15, 2026·also MIT CSAIL, Tsinghua AI, BJTU, Fudan +1

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

LLM benchmark accuracy jumps 10% when evaluated on a cleaned-up version of Humanity's Last Exam, highlighting the significant impact of dataset noise on performance metrics.

Weiqi Zhai, Weiqi Zhai, Zhihai Wang +49

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Tsinghua AIFeb 15, 2026·also DAMO, BUPT, Texas A&M

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

A new family of GUI agents, GUI-Owl-1.5, leapfrogs existing open-source models on 20+ GUI benchmarks, proving that multi-platform, real-time GUI automation is now within reach.

Haiyang Xu, Haiyang Xu, Xi Zhang +32

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

DAMOFeb 15, 2026·also CAS

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Ditch the black-box reward function: this new rubric-based RL framework uses LLMs to judge responses against interpretable criteria, offering a more robust and transparent approach to alignment.

Ruipeng Jia, Yunyi Yang, Yuxin Wu +2

Interpretability & Mechanistic Interp RLHF & Preference Learning Scalable Oversight & Alignment Theory

DAMOFeb 15, 2026

DAIAN: Deep Adaptive Intent-Aware Network for CTR Prediction in Trigger-Induced Recommendation

Overcome "intent myopia" in trigger-based recommendations with DAIAN, a network that adaptively learns user intent from click correlations and hybrid ID/semantic similarity, boosting CTR in e-commerce.

Zhihao Lv, Longtao Zhang, Ailong He +3

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

Feb 13, 2026

Feb 13, 2026·also CMU ML, DAMO, PKU, USTC +2

RynnBrain: Open Embodied Foundation Models

RynnBrain leapfrogs existing embodied foundation models, offering a unified, open-source spatiotemporal model that excels at physically grounded reasoning and planning across a wide range of benchmarks.

Ronghao Dang, Jiayan Guo, Jiayan Guo +29

Multimodal Models Robotics & Embodied AI World Models & Planning

Feb 12, 2026

Feb 12, 2026·also DAMO, Chongqing Ant Consumer Finance Co., NTU

SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

LLMs can overcome "tunnel vision" in multi-turn search scenarios by using information gain to guide dynamic prompting interventions, leading to more efficient and accurate reasoning.

Jinluan Yang, Yiquan Wu, Yi Liu +2

Reasoning & Chain-of-Thought Recommendation & Information Retrieval Tool Use & Agents

DAMOFeb 12, 2026·also BAAI

IntTravel: A Real-World Dataset and Generative Framework for Integrated Multi-Task Travel Recommendation

Key contribution not extracted.

Longfei Xu, Zheng Liu, Xiangxiang Chu

Data Curation & Synthetic Data Recommendation & Information Retrieval

Feb 1, 2026

Thai Nguyen University of InformationFeb 1, 2026·also DAMO, Google Research, Meta AI, Thai Nguyen University

Parameter-efficient fine-tuning of small language models for code generation: a comparative study of Gemma, Qwen 2.5 and Llama 3.2

Forget huge models: parameter-efficient fine-tuning turns tiny language models into code-generating powerhouses that outperform larger, untuned counterparts.

Van-Viet Nguyen, The-Vinh Nguyen, Huu-Khanh Nguyen +1

Jan 4, 2026

DAMOJan 4, 2026·also Fudan, Tongji

Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement

Failure-driven post-training, combined with a meticulously curated 10M token STEM dataset, unlocks a 4.68% performance boost in LLM reasoning, proving that strategic data synthesis around model weaknesses is a powerful path to improvement.

Mingyu Xu, Cheng Fang, Keyue Jiang +16

Nov 27, 2025

B active) differ by an order of magnitude in active parameters. ConverselyNov 27, 2025·also AI2, DAMO, Google Research, Meta AI +2

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

LLM safety guardrails are far less robust than benchmarks suggest, with accuracy dropping by as much as 57% on novel adversarial attacks, and some even generating harmful content in a "helpful mode" jailbreak.

Richard J. Young

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Oct 30, 2025

Tsinghua AIOct 30, 2025·also DAMO, CAS, Shenzhen University of Advanced Technology

ToolRM: Towards Agentic Tool-Use Reward Modeling

ToolRMs drastically improve tool-use accuracy in LLMs, outperforming existing models by up to 17.94%, while also reducing output token usage by over 66% through efficient inference-time scaling.

Renhao Li, Jianhong Tu, Yang Su +6

RLHF & Preference Learning Tool Use & Agents