Kai Yu

X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Speech & Audio (4)Natural Language Processing (3)Architecture Design (Transformers, SSMs, MoE) (2)Multimodal Models (2)

Frequent co-authors

Jing Peng (2)Bohan Li (1)Shiyue Lian (1)Yiwei Guo (1)

Papers (5)

May 28, 2026

2w ago

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Finally, a speech tokenizer that doesn't require extra optimization tricks to work robustly for both generation and understanding tasks in a unified architecture.

Bohan Li, Shiyue Lian, Yiwei Guo +5

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

May 6, 2026

Yuanzhi Wang +9May 6, 2026·also SJTU

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

Identity-preserving video generation just got a whole lot more faithful: FaithfulFaces maintains identity even under extreme pose variations and occlusions, a feat previous methods struggled with.

Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng +7

Computer Vision Multimodal Models Natural Language Processing

Apr 29, 2026

Apr 29, 2026·also SJTU, Soul AI Lab

Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification

Adversarial training doesn't have to hurt speaker verification: by explicitly modeling language, you can disentangle speaker and language characteristics without sacrificing speaker discriminability.

Qituan Shangguan, Junhao Du, Kunyang Peng +3

Natural Language Processing Red-Teaming & Adversarial Robustness Speech & Audio

Apr 9, 2026

Jing Peng +10Apr 9, 2026·also AI Lab, SJTU, XJTU

TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

Forget expensive audio-text data collection: TASU2 lets you dial in the perfect amount of noise for training your speech LLM, all from text.

Jing Peng, Jing Peng, Chenghao Wang +8

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Mar 11, 2026

Jing Peng +9Mar 11, 2026·also SJTU

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

G-STAR tackles long-form, multi-speaker ASR by giving Speech-LLMs time-aware speaker tracking, enabling robust identity linking across chunks.