Tsinghua AI
Tsinghua University's AI research group. Leading Chinese institution in NLP, knowledge graphs, and large language models.
ml.cs.tsinghua.edu.cn23
999
43
Top Researchers
Recent Papers
FunReason-MT is presented, a novel data synthesis framework for real-world multi-turn tool use that resolves the complexity barrier in multi-turn FC data by employing 1) Environment-API Graph Interactions to gather varied high-quality trajectories, 2) Advanced Tool-Query Synthesis to simplify hard query construction, and 3) Guided Iterative Chain for sophisticated CoT generation.
FunReason-MT is presented, a novel data synthesis framework for real-world multi-turn tool use that resolves the complexity barrier in multi-turn FC data by employing 1) Environment-API Graph Interactions to gather varied high-quality trajectories, 2) Advanced Tool-Query Synthesis to simplify hard query construction, and 3) Guided Iterative Chain for sophisticated CoT generation.
The paper introduces "analytical search" as a new search paradigm tailored for complex analytical information needs, addressing the limitations of relevance-based ranking and retrieval-augmented generation (RAG) in tasks requiring trend analysis, causal inference, and verifiable conclusions. It proposes a system framework that integrates query understanding, recall-oriented retrieval, reasoning-aware fusion, and adaptive verification to support structured, multi-step inference. The authors argue that analytical search offers improved control over reasoning, evidence usage, and verifiability, leading to more accountable and utility-driven results compared to existing search paradigms.
Introduces and formalizes the concept of "analytical search" as a distinct search paradigm designed to address complex analytical information needs by emphasizing evidence-governed, process-oriented workflows.
The paper introduces PatientHub, a unified framework to standardize the creation, composition, and deployment of simulated patients for training counselors and scaling therapeutic assessment using Large Language Models. PatientHub addresses the fragmentation in existing patient simulation approaches by providing standardized data formats, prompts, and evaluation metrics, thus improving reproducibility and enabling fair comparisons. The authors demonstrate PatientHub's utility through case studies, showcasing standardized cross-method evaluation, seamless integration of custom evaluation metrics, and the prototyping of new simulator variants.
Introduces PatientHub, a modular framework that unifies patient simulation by standardizing data formats, prompts, and evaluation metrics to facilitate reproducibility and fair comparison of different methods.
The paper identifies limitations in current Vision-Language-Action (VLA) models stemming from inadequate visual representations learned through language-image contrastive learning or image-based self-supervised learning. It proposes JEPA-VLA, a method that integrates video predictive embeddings (specifically V-JEPA 2) into VLAs to improve environment understanding and policy priors. Experiments on benchmarks like LIBERO and real-robot tasks demonstrate that JEPA-VLA significantly improves performance by leveraging the ability of video predictive embeddings to encode task-relevant temporal dynamics.
Introduces JEPA-VLA, a novel approach that adaptively integrates video predictive embeddings into existing VLAs to enhance environment understanding and policy priors.
The paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that combines sparse attention (InfLLM-V2) and linear attention (Lightning Attention) to improve long-context modeling efficiency. A layer selection algorithm integrates the two attention mechanisms in a 1:3 ratio, along with a hybrid positional encoding (HyPE), to maintain performance while improving efficiency. The paper also presents a cost-effective continual training framework that transforms pre-trained Transformer models into hybrid models, reducing training costs by 75% and enabling the model to achieve 3.5x faster inference speeds at 256K sequence length and supporting context lengths up to 1M tokens on a single NVIDIA A6000D GPU.
Introduces a hybrid sparse and linear attention architecture, MiniCPM-SALA, that achieves efficient long-context modeling with minimal performance degradation compared to full-attention models.
The paper introduces LAVES, a hierarchical LLM-based multi-agent system, to generate high-quality instructional videos from educational problems by decomposing the generation workflow into specialized agents for problem-solving, visualization, and narration. LAVES addresses limitations of end-to-end video generation models in scenarios requiring logical rigor and precise knowledge representation. The system achieves a throughput of over one million videos per day with a 95% cost reduction compared to industry standards, while maintaining a high acceptance rate, by constructing a structured executable video script compiled into synchronized visuals and narration.
Introduces a hierarchical LLM-based multi-agent system (LAVES) that decomposes educational video generation into specialized agents, enabling automated end-to-end production with high throughput and cost efficiency.
The paper introduces NarraScore, a hierarchical framework for generating soundtracks for long-form videos by leveraging emotion as a compressed representation of narrative logic. It uses frozen Vision-Language Models (VLMs) to extract Valence-Arousal trajectories from video and employs a Dual-Branch Injection strategy, consisting of a Global Semantic Anchor and a Token-Level Affective Adapter, to control musical dynamics. Experiments show that NarraScore achieves state-of-the-art consistency and narrative alignment with minimal computational cost.
Introduces a hierarchical framework, NarraScore, that leverages VLMs and a dual-branch injection strategy to generate narrative-aligned soundtracks for long-form videos.
The paper introduces EvoMDT, a self-evolving multi-agent system designed to improve structured clinical decision-making in multi-cancer multidisciplinary tumor boards (MDTs). EvoMDT uses a self-evolution loop to dynamically update prompts, consensus weights, and retrieval scope based on expert feedback and outcome signals, enhancing robustness and traceability. Evaluated on oncology QA benchmarks and real-world datasets, EvoMDT outperformed LLM baselines, achieving higher guideline concordance, semantic alignment with expert plans, and comparable decision quality to human MDTs with reduced response time.
Introduces a self-evolving multi-agent system, EvoMDT, that adaptively refines its decision-making process for cancer treatment recommendations based on expert feedback and outcome signals.
The paper introduces GPT-5, a unified system comprising a fast, general-purpose model and a deeper reasoning model, managed by a real-time router trained on user feedback and performance metrics. GPT-5 demonstrates improved performance on benchmarks, faster response times, and enhanced utility for real-world queries, with significant reductions in hallucinations, improved instruction following, and minimized sycophancy. The system incorporates "safe-completions" for safety and is treated as High capability in the Biological and Chemical domain under OpenAI's Preparedness Framework, triggering associated safeguards.
Introduces a unified GPT-5 system with a real-time router that dynamically selects between a fast, general-purpose model and a deeper reasoning model based on query characteristics, optimizing for speed and accuracy.
The International AI Safety Report 2025's Second Key Update analyzes the current state of AI risk management and technical mitigations employed by researchers, companies, and governments. It highlights advancements in training safer models and monitoring outputs while acknowledging uncertainties in the effectiveness of these measures and their variability across applications. The report aims to inform policymakers, researchers, and the public about progress and remaining gaps in AI safety.
Synthesizes recent developments in AI risk management and technical risk mitigation strategies, identifying both progress and persistent gaps in ensuring the safety of general-purpose AI systems.
The paper introduces CycleChemist, a dual-pronged machine learning framework for organic photovoltaic (OPV) material discovery, addressing the limitation of existing methods that focus on either donor or acceptor materials in isolation. They curate the Organic Photovoltaic Donor Acceptor Dataset (OPV2D), containing 2000 experimentally characterized donor-acceptor pairs, and develop a hierarchical graph neural network (OPVC) to predict OPV behavior, incorporating multi-task learning and donor-acceptor interaction modeling. The framework also includes MatGPT, a generative transformer for producing synthetically accessible organic semiconductors, guided by reinforcement learning to optimize material properties.
Introduces CycleChemist, a novel dual machine learning framework that integrates predictive modeling with generative molecular design for the data-driven discovery of high-performance OPV materials.
The paper introduces a Dual Loop Data Cleaning (DLDC) method to automatically generate high-quality remote sensing image-text training data by leveraging contrastive multimodal quality evaluations. DLDC uses an external generation loop (EGL) based on a multimodal foundational model for layout description and an internal evaluation loop (IEL) based on contrastive learning metrics to assess image-text matching. Fine-tuning T2I models with the cleaned dataset results in significant improvements in image generation quality, as evidenced by substantial reductions in FID and increases in CLIP and RemoteCLIP scores, and improved downstream segmentation performance.
Introduces a dual-loop data cleaning method (DLDC) that automatically generates high-quality remote sensing image-text training data, eliminating the need for manual annotation.
The paper introduces FunReason-MT, a novel data synthesis framework designed to generate high-quality, multi-turn training data for function calling in large language models, addressing limitations in existing methods like random sampling and multi-agent role-playing. FunReason-MT employs Environment-API Graph Interactions, Advanced Tool-Query Synthesis, and Guided Iterative Chain to overcome challenges in targeted data synthesis, hard query construction, and multi-turn logical dependency. Experiments on BFCLv3 and BFCLv4 show that models trained on FunReason-MT data achieve state-of-the-art performance among comparable-sized models, demonstrating the framework's effectiveness in agentic learning.
Introduces FunReason-MT, a data synthesis framework that generates high-quality, multi-turn function calling data by integrating Environment-API Graph Interactions, Advanced Tool-Query Synthesis, and Guided Iterative Chain.
The paper introduces SAMR, a Spatial-Augmented Mixed Reality method, to improve Vision-Language Model (VLM) performance in 3D scene understanding within mixed reality environments. SAMR uses FastSAM-based segmentation to generate object-level meshes from HMD images, maps feature points to 3D coordinates via ray casting and triangular facet fitting, and integrates multimodal interactions (gestures, gaze, voice) for prompt annotation. Experiments across six application scenarios (object identification, relationship analysis, etc.) demonstrate SAMR's effectiveness in enhancing VLMs for 3D scene interpretation.
Introduces SAMR, a novel spatial-augmented mixed reality method, that enhances VLMs by incorporating spatial context and multimodal interaction for improved 3D scene understanding.
The paper introduces dVLA, a diffusion-based Vision-Language-Action model that unifies visual perception, language reasoning, and robotic control under a single diffusion objective. dVLA incorporates a multimodal chain-of-thought to improve cross-modal reasoning and generalization. The model achieves state-of-the-art performance on the LIBERO benchmark (96.4% success rate) and demonstrates robust real-world performance on a Franka robot, including a challenging bin-picking task.
Introduces dVLA, a diffusion-based VLA model that unifies perception, language, and action with a multimodal chain-of-thought, achieving strong performance and generalization in robotic tasks.
The paper introduces Structured-MoE STL Planner (S-MSP), a differentiable framework for end-to-end task and motion planning from multi-view camera observations and STL specifications. S-MSP integrates STL constraints directly into the training loop using a composite loss function that combines trajectory reconstruction and STL robustness. The core innovation is a structure-aware Mixture-of-Experts (MoE) model that enables horizon-aware specialization by projecting sub-tasks into temporally anchored embeddings, leading to improved STL satisfaction and trajectory feasibility in factory-logistics scenarios.
Introduces a differentiable end-to-end framework, S-MSP, that directly maps multi-view camera observations and STL specifications to feasible trajectories using a structure-aware Mixture-of-Experts model.
The paper introduces Constructive Safety Alignment (CSA), a paradigm shift from refusal-based safety mechanisms in LLMs to a more human-centric approach that actively guides vulnerable users towards safe and helpful outcomes. CSA incorporates game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control. The implementation, Oyster-I (Oy1), demonstrates state-of-the-art safety among open models while maintaining high general capabilities, exhibiting strong constructive engagement and robustness against jailbreaks.
This paper pioneers Constructive Safety Alignment (CSA), a novel paradigm that transforms LLM safety from reactive refusal to proactive guidance, specifically addressing the needs of vulnerable users.
This paper introduces an automated recommendation system for Sea-River-Inland Waterway Intermodal Transport (SRIIT) that optimizes dry bulk cargo transport by considering port capacity, path throughput, and time windows. A dual-mode scheduling algorithm (fair/priority-based) resolves resource competition using throughput-aware simulation and daily resource recovery. The system, validated on Yancheng's network, demonstrates reduced transport time, prioritized delivery during congestion, and customizable cost-time-carbon balancing.
Introduces a weight-driven optimization framework for intermodal transport that balances cost, time, and carbon emissions based on shipper preferences.
This paper provides a theoretical analysis of the performance differences between RLHF and DPO, decomposing the gap into explicit (optimization) and implicit (finite sample) representation gaps. The analysis characterizes how the relative capacities of reward and policy model classes impact policy quality under model misspecification, revealing scenarios where RLHF, DPO, or online DPO can outperform each other. Furthermore, the paper demonstrates a statistical advantage for RLHF in settings with implicitly sparse ground-truth rewards, requiring fewer samples to learn an effective reward model.
Decomposes the performance gap between RLHF and DPO into explicit and implicit representation gaps, providing a nuanced understanding of their relative strengths and weaknesses under varying model misspecifications and sample complexities.
InternVL3 is a new multimodal model trained from scratch using a native multimodal pre-training paradigm, jointly learning from multimodal and text data, thus avoiding the alignment issues of adapting text-only LLMs. The model incorporates variable visual position encoding (V2PE) for longer contexts and uses post-training techniques like SFT and MPO, along with test-time scaling. InternVL3-78B achieves state-of-the-art performance among open-source MLLMs, scoring 72.2 on MMMU and rivaling proprietary models while maintaining strong language proficiency.
Introduces a native multimodal pre-training paradigm that jointly learns multimodal and linguistic capabilities from scratch, eliminating the need to adapt text-only LLMs for multimodal tasks.
The paper introduces Adaptive Gradient-Masked Reinforcement (AGMR) Attack, a novel white-box adversarial attack method designed to effectively target deep reinforcement learning (DRL) agents in robotic control by addressing the limitations of existing supervised learning-based attacks. AGMR uses a gradient-based soft masking mechanism to dynamically identify and selectively perturb critical state dimensions, optimizing the adversarial policy for maximum impact on long-term rewards. Experiments demonstrate that AGMR significantly outperforms existing adversarial attack methods in reducing victim agent performance and improving robustness through adversarial training.
Introduces a novel white-box adversarial attack, AGMR, that selectively perturbs critical state dimensions in DRL agents using a gradient-based soft masking mechanism to maximize impact on long-term rewards.
The paper introduces Confidence-Reward driven Preference Optimization (CRPO), a novel preference optimization method for machine translation that addresses the challenge of low-quality preference data in Direct Preference Optimization (DPO). CRPO enhances data selection by incorporating model confidence alongside reward scores, focusing on challenging sentence pairs where the model exhibits uncertainty or poor performance. Experiments demonstrate that CRPO outperforms existing preference optimization techniques, including RS-DPO, RSO, and MBR score, in both translation accuracy and data efficiency across LLMs and encoder-decoder models like NLLB.
Introduces a confidence-reward driven preference optimization (CRPO) method that improves the quality of preference data used in direct preference optimization by selecting challenging sentence pairs based on model uncertainty and reward scores.

