Search papers, labs, and topics across Lattice.
K [16]. Following DeiT [72], we develop three variants of BinaryAttention, namely -T (tiny), -S (small) and -B (base), by substituting all standard attention modules with BinaryAttention. We follow the experimental settings in DeiT [72], which are detailed in supplementary file. The models are fine-tuned with the self-distillation [34] strategy, where the full-precision counterparts serve as the teacher. We compare with quantization based methods PTQ, 脳\times and 1.
2
0
5
BinaryAttention proves you can more than halve the runtime of attention in vision and diffusion transformers without sacrificing accuracy, simply by using the sign of queries and keys.
Train a competitive 2B MoE LLM on 16 commodity GPUs connected via the internet, proving you don't need a datacenter to play the LLM game.