Search papers, labs, and topics across Lattice.
D RoPEs applied per axis and then composed as above. 3.2 Unified Camera-Frame RoPE Fine-tuning a pretrained DiT video diffusion model into a stereo world model requires injecting camera conditioning – including stereo cameras with varying baselines and dynamic camera motions – while minimizing disruption to the pretrained prior. A common approach concatenates Plükcer Ray encodings [75] onto the input feature channels. However, similar to early positional encoding methods [56], this approach relies on absolute coordinates, making it sensitive to the choice of reference frame. To mitigate this limitation, recent methods such as GTA [40] and PRoPE [34] model relative camera positions, yielding improved generalization. Specifically, PRoPE replaces 𝐑Δt,Δx,Δy\mathbf{R}_{\Delta t,\Delta x,\Delta y} in Eq. (3) with 𝐑Δt,Δx,ΔyΔcam\mathbf{R}_{\Delta t,\Delta x,\Delta y}^{\Delta\texttt{cam}}, where 𝐑Δt,Δx,ΔyΔcam(d)=\displaystyle\mathbf{R}_{\Delta t,\Delta x,\Delta y}^{\Delta\texttt{cam}}(d)= 𝐑t1,x1,y1camt1(d)(𝐑t2,x2,y2camt2(d))⊤,\displaystyle\mathbf{R}_{t_{1},x_{1},y_{1}}^{\texttt{cam}_{t_{1}}}(d)(\mathbf{R}_{t_{2},x_{2},y_{2}}^{\texttt{cam}_{t_{2}}}(d))^{\top}, (4) 𝐑tj,xj,yjcamtj(d)=\displaystyle\mathbf{R}_{t_{j},x_{j},y_{j}}^{\texttt{cam}_{t_{j}}}(d)= [𝐈d/8⊗𝐏j𝟎𝟎𝐑tj,xj,yj(d/2)],\displaystyle\begin{bmatrix}\mathbf{I}_{d/8}\otimes{\mathbf{P}}_{j}&\mathbf{0}\\ \mathbf{0}&\mathbf{R}_{t_{j},x_{j},y_{j}}(d/2)\end{bmatrix}, (5) 𝐏j=\displaystyle{\mathbf{P}}_{j}= [𝑲j𝟎𝟎1]𝑻j,𝑲j,𝑻j=camtj.\displaystyle\begin{bmatrix}\boldsymbol{K}_{j}&\mathbf{0}\\ \mathbf{0}&1\end{bmatrix}\boldsymbol{T}_{j},\quad\boldsymbol{K}_{j},\boldsymbol{T}_{j}=\texttt{cam}_{t_{j}}. Here j∈{1,2}j\in\{1,2\}, ⊗\otimes is the Kronecker product, and 𝐈d/8∈ℝd/
1
0
3
Generate consistent stereo videos directly from RGB data, bypassing depth estimation and monocular-to-stereo conversion, with StereoWorld's novel camera-aware attention mechanisms.