Search papers, labs, and topics across Lattice.
This paper introduces DiV-INR, a video compression framework that leverages implicit neural representations (INRs) to condition pre-trained video diffusion models for extremely low bitrate compression. By replacing traditional keyframes with INRs trained to estimate latent features, the approach provides compact video representations that guide the diffusion process. Joint optimization of INR weights and parameter-efficient adapters for diffusion models results in significant improvements in perceptual metrics compared to HEVC, VVC, and other neural codecs at bitrates below 0.05 bpp.
Achieve perceptually superior video compression at extremely low bitrates by using implicit neural representations to condition diffusion models, outperforming even VVC and prior neural codecs.
We present a perceptually-driven video compression framework integrating implicit neural representations (INRs) and pre-trained video diffusion models to address the extremely low bitrate regime (<0.05 bpp). Our approach exploits the complementary strengths of INRs, which provide a compact video representation, and diffusion models, which offer rich generative priors learned from large-scale datasets. The INR-based conditioning replaces traditional intra-coded keyframes with bit-efficient neural representations trained to estimate latent features and guide the diffusion process. Our joint optimization of INR weights and parameter-efficient adapters for diffusion models allows the model to learn reliable conditioning signals while encoding video-specific information with minimal parameter overhead. Our experiments on UVG, MCL-JCV, and JVET Class-B benchmarks demonstrate substantial improvements in perceptual metrics (LPIPS, DISTS, and FID) at extremely low bitrates, including improvements on BD-LPIPS up to 0.214 and BD-FID up to 91.14 relative to HEVC, while also outperforming VVC and previous strong state-of-the-art neural and INR-only video codecs. Moreover, our analysis shows that INR-conditioned diffusion-based video compression first composes the scene layout and object identities before refining textural accuracy, exposing the semantic-to-visual hierarchy that enables perceptually faithful compression at extremely low bitrates.