Search papers, labs, and topics across Lattice.
Soochow University
3
0
4
7
Generative recommendation models like OneRec-V2 can achieve near-lossless FP8 quantization, unlocking significant latency and throughput improvements, unlike traditional recommender systems.
Forget content, remember position: crafting pseudo-queries based on token position alone yields surprisingly effective KV cache compression for LLMs, rivaling methods that analyze input semantics.
Achieve 11.8x faster reasoning with 80% KV cache compression by estimating token importance directly from FlashAttention's intermediate results – no extra compute needed.