Search papers, labs, and topics across Lattice.
This paper explores the impact of FP8 and INT8 quantization on the VGG16 model's performance, memory footprint, and inference time compared to FP32, using post-training quantization and layer fusion to minimize quantization overhead. The authors optimize the quantized model for GPU architectures, leveraging parallelization to address the computational demands. Experiments on CIFAR-10 with an NVIDIA RTX 4090 show that FP8 quantization achieves a 40% speed-up and reduces memory usage by 32% compared to FP32, while INT8's performance is limited by hardware support.
FP8 quantization slashes VGG16's inference time by 40% and memory footprint by 32% on an RTX 4090, making it a sweet spot for efficient GPU deployment compared to INT8 and FP32.
Deep Neural Networks have achieved remarkable success in various machine learning tasks, such as computer vision and natural language processing, but their computational and memory requirements often limit their deployment. Quantization offers a promising solution by reducing model precision while maintaining accuracy, and it can be combined with other common optimization strategies for further efficiency gains. This paper investigates the effects of quantization on the VGG16 model, comparing different precision levels (FP32, FP8, and INT8) in terms of accuracy, inference time, and memory usage. We propose an optimized approach, based on the Post-Training Quantization method, that minimizes redundant quantization and dequantization steps through layer fusion. Moreover, we leverage the GPU architectures and their parallelization capabilities to overcome the challenges posed by the high computational complexity of Deep Neural Networks. Experiments conducted on the CIFAR-10 dataset using an NVIDIA RTX 4090 demonstrate that, while INT8 quantization does not consistently outperform FP32 due to the lack of full specific hardware support, FP8 quantization achieves a better balance between speed and memory footprint with minimal accuracy degradation. Specifically, at batch size 4096, FP8 reduces inference time from 789ms to 471ms (≈40% speed-up) and cuts GPU memory usage from 100% to 68% compared to FP32, whereas INT8 reaches an inference time of 475ms but maintains or slightly increases memory usage to 104%. Our findings underscore the importance of hardware-aware quantization techniques and demonstrate the viability of reduced-precision models in real-time applications and resource-constrained environments.