R reconstructs large-scaleResearch Institute of Interdisciplinary InnovationSDUZJUApr 8, 2026arXiv:2604.06576

LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation

Shuai Li, Huibin Bai, Huibin Bai, Yanbo Gao, Yanbo Gao, Chong Lv, Chong Lv, Hui Yuan, Hui Yuan, Chuankun Li, Chuankun Li, Wei Hua, Wei Hua, Tian Xie, Tian Xie

AI Summary

This paper introduces LiftFormer, a novel monocular depth estimation (MDE) architecture that leverages lifting theory to construct intermediate subspaces for bridging image color features and depth values. The core idea is to transform depth value prediction into a depth-oriented geometric representation (DGR) subspace feature representation, enhanced by an edge-aware representation (ER) subspace to improve prediction accuracy around edges. Experiments demonstrate state-of-the-art performance on standard MDE datasets, validating the effectiveness of the proposed lifting modules.

Key Contribution

LiftFormer achieves state-of-the-art monocular depth estimation by cleverly mapping image features into depth-oriented geometric subspaces, sidestepping the ill-posed nature of direct depth prediction.

Abstract

Monocular depth estimation (MDE) has attracted increasing interest in the past few years, owing to its important role in 3D vision. MDE is the estimation of a depth map from a monocular image/video to represent the 3D structure of a scene, which is a highly ill-posed problem. To solve this problem, in this paper, we propose a LiftFormer based on lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values, and a subspace that enhances the depth prediction around edges. MDE is formulated by transforming the depth value prediction problem into depth-oriented geometric representation (DGR) subspace feature representation, thus bridging the learning from color values to geometric depth values. A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins to provide a redundant and robust representation. The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values. Moreover, considering that edges usually present sharp changes in a depth map and tend to be erroneously predicted, an edge-aware representation (ER) subspace is constructed, where depth features are transformed and further used to enhance the local features around edges. The experimental results demonstrate that our LiftFormer achieves state-of-the-art performance on widely used datasets, and an ablation study validates the effectiveness of both proposed lifting modules in our LiftFormer.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations1

Influential citations0

References75

Year2026

VenueIEEE transactions on multimedia

Related Papers

Finding related papers...

Search

LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation

Related Papers