Xuanjing Huang

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Eval Frameworks & Benchmarks (6)Tool Use & Agents (5)RLHF & Preference Learning (4)Red-Teaming & Adversarial Robustness (2)

Frequent co-authors

Tao Gui (6)Shihan Dou (4)Zhiheng Xi (4)Shichun Liu (3)

Papers (11)

Apr 21, 2026

Bowen Li +81w ago

Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

Current benchmarks miss the point: the real value of AI peer review lies in the quality of its textual justification, not just predicting a rating.

Bowen Li, Haochen Ma, Hao Ma +6

Eval Frameworks & Benchmarks Natural Language Processing

1w ago·also Fudan, Shanghai AI Lab, Shanghai Qiji Zhifeng Co.

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

Learned critics in RLHF can actually *increase* variance and hurt performance in sparse-reward settings, but a simple explained variance metric can tell you when to ditch the critic and get better results.

Chengjun Pan, Shichun Liu, Jiahang Lin +8

RLHF & Preference Learning Training Efficiency & Optimization

Apr 15, 2026

2w ago

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Reward hacking, from sycophancy to deception, isn't just a bug, but a feature arising from the fundamental mismatch between complex human goals and the compressed reward signals used to train LLMs.

Xiaohua Wang, Muzhao Tian, Yuqiyu Zeng +22

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

Jiahang Lin +162w ago·also Fudan

MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

Multi-turn reinforcement learning gets a boost: weighting trajectories by semantic similarity dramatically improves baseline estimation and agent performance in long-document visual QA.

Jiahang Lin, Kai Hu, Binghai Wang +14

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Apr 8, 2026

Jianhong Pang +43w ago

Steering the Verifiability of Multimodal AI Hallucinations

You can dial up or down how obvious an AI's hallucinations are, giving you control over whether users catch the errors.

Jianhong Pang, Ruoxi Cheng, Ziyi Ye +2

Eval Frameworks & Benchmarks Multimodal Models Red-Teaming & Adversarial Robustness

Mar 16, 2026

CCTU: A Benchmark for Tool Use under Complex Constraints

Even the best LLMs fail to follow complex constraints in tool use more than 50% of the time, revealing a critical weakness in real-world agent deployment.

Junjie Ye, Guoqiang Zhang, Wenjie Fu +2

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Mar 15, 2026

Mar 15, 2026·also HuggingFace, McMaster University, Oxford

AI Can Learn Scientific Taste

Forget benchmarks: AI can now learn "scientific taste" and propose research ideas with higher potential impact than humans, thanks to a novel reinforcement learning approach using citation data.

Jingqi Tong, Mingzhe Li, Hangcheng Li +14

RLHF & Preference Learning Scientific Discovery & Drug Design

Mar 12, 2026

LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Current LLMs fall short in understanding implicit intentions and modeling long-term user preferences, as revealed by a new benchmark, LifeSim-Eval, designed to simulate real-world user-assistant interactions.

Feiyu Duan, Xuanjing Huang, Zhongyu Wei

Eval Frameworks & Benchmarks Tool Use & Agents World Models & Planning

Mar 12, 2026

Can RL Improve Generalization of LLM Agents? An Empirical Study

RFT's impressive in-domain performance masks surprisingly weak generalization to new environments, highlighting a critical challenge for deploying LLM agents in the real world.

Zhiheng Xi, Jiazheng Zhang, Yutao Fan +8

Eval Frameworks & Benchmarks RLHF & Preference Learning Tool Use & Agents

Feb 13, 2026

Feb 13, 2026·also Fudan

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

GPT-5's scientific reasoning skills plummet by nearly 50% when tackling multi-step workflows, revealing a critical gap in current LLM agents' ability to orchestrate complex tool use.

Yujiong Shen, Yajie Yang, Zhiheng Xi +12

Eval Frameworks & Benchmarks Scientific Discovery & Drug Design Tool Use & Agents

Jan 7, 2026

Google ResearchJan 7, 2026·also Fudan, HuggingFace

Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control

Finally, a fully open-source, reproducible system for long-form song generation is here, complete with licensed data, code, and a Qwen-based model that rivals closed-source systems.

Changhao Jiang, Jiahao Chen, Zhenghao Xiang +14

Data Curation & Synthetic Data Open-Source Models & Weights Speech & Audio

Search

Xuanjing Huang

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (11)