Search papers, labs, and topics across Lattice.
X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
4
0
8
Compressing audio semantics into just 128 dimensions doesn't just reduce DiT modeling burden; it actually *improves* audio generation quality across diverse domains.
Over-reliance on agentic decomposition can actually *hurt* audio understanding when a strong audio frontend already provides sufficient information, highlighting the importance of conditional evidence acquisition.
ASR systems can now be more trustworthy: this work shows how to train them to abstain from transcribing uncertain segments, leading to more reliable outputs.
Forget expensive audio-text data collection: TASU2 lets you dial in the perfect amount of noise for training your speech LLM, all from text.