Search papers, labs, and topics across Lattice.
School of Computer Science, Wuhan University, Wuhan, China
2
0
3
6
LLM inference gets a 2x speed boost without training, thanks to a clever technique that merges retrieval with logit-based speculation.
Structural, numerical, and algebraic redundancy across pruning, quantization, and low-rank decomposition techniques are analyzed, enabling a criticality-aware compression framework that achieves near-lossless compression to 10% of the original size.