Search papers, labs, and topics across Lattice.
F(x)=\tfrac{1}{2}x^{\top}Ax+b^{\top}x+\lambda\|x\|_{1} decreases and xkx_{k} converges to a minimizer. Construction. We compute yk=xk−γ(Axk+b)y_{k}=x_{k}-\gamma(Ax_{k}+b) with the (U) head and implement soft-thresholding with a fixed two-layer width-2n2n ReLU FFN with weights W1=[In−In]∈ℝ2n×n,W2=[In−In]∈ℝn×2n,W_{1}=\begin{bmatrix}I_{n}\\[-2.0pt] -I_{n}\end{bmatrix}\!\in\mathbb{R}^{2n\times n},\qquad W_{2}=\begin{bmatrix}I_{n}&-I_{n}\end{bmatrix}\!\in\mathbb{R}^{n\times 2n}, and bias −θ[𝟏n;𝟏n]-\theta[\mathbf{1}_{n};\mathbf{1}_{n}]. Then xk+1=W2ReLU(W1yk−θ[𝟏n𝟏n])=(yk−θ𝟏n)+−(−yk−θ𝟏n)+=𝒮θ(yk).\begin{split}x_{k+1}&=W_{2}\,\mathrm{ReLU}\!\Big(W_{1}y_{k}-\theta\!\begin{bmatrix}\mathbf{1}_{n}\\ \mathbf{1}_{n}\end{bmatrix}\Big)\\ &=(y_{k}-\theta\mathbf{1}_{n})_{+}-(-y_{k}-\theta\mathbf{1}_{n})_{+}=\mathcal{S}_{\theta}(y_{k}).\end{split} since (u−θ)+−(−u−θ)+=sign(u)(|u|−θ)+(u-\theta)_{+}-(-u-\theta)_{+}=\mathrm{sign}(u)\,(|u|-\theta)_{+} coordinatewise. With θ=γλ\theta=\gamma\lambda, this equals proxγλ∥⋅∥1(yk)\operatorname{prox}_{\gamma\lambda\|\cdot\|_{1}}(y_{k}), i.e., the ISTA update. Full steps are provided in Appendix A.3. Figure 2(b) shows that the transformer construction matches ISTA’s convergence behavior across depth. ∎ Proposition 3.4. One (U) gradient step followed by the same fixed two-layer ReLU FFN wrapped in a scalar threshold loop performs the exact Euclidean projection onto the ℓ1\ell_{1}-ball: xk+1=Proj{‖x‖1≤B}(xk−γ(Axk+b)).x_{k+1}=\mathrm{Proj}_{\{\|x\|_{1}\leq B\}}\!\big(x_{k}-\gamma(Ax_{k}+b)\big). With 0<γ≤1/L0<\gamma\leq 1/L, this is projected gradient descent for (C), so xkx_{k} converges to an optimal solution. Construction. From the (U) head we form yk=xk−γ(Axk+b)y_{k}=x_{k}-\gamma(Ax_{k}+b). We then run a scalar threshold loop θt+1=θt+η[‖𝒮θt(yk)‖1−B]+,θ0=0, 0<η≤1n,\theta_{t+1}=\theta_{t}+\eta\,\big[\!\|\mathcal{S}_{\theta_{t}}(y_{k})\|_{1}-B\big]_{+},\qquad\theta_{0}=0,\ \ 0<\eta\leq\tfrac{1}{n}, and set xk+1=𝒮θ
1
0
2
Transformers can directly solve quadratic programs and leverage covariance matrices for superior decision-making, outperforming traditional "predict-then-optimize" methods in portfolio construction.