Kutay Tire

F(x)=\tfrac{1}{2}x^{\top}Ax+b^{\top}x+\lambda\|x\|_{1} decreases and xkx_{k} converges to a minimizer. Construction. We compute yk=xk−γ(Axk+b)y_{k}=x_{k}-\gamma(Ax_{k}+b) with the (U) head and implement soft-thresholding with a fixed two-layer width-2n2n ReLU FFN with weights W1=[In−In]∈ℝ2n×n,W2=[In−In]∈ℝn×2n,W_{1}=\begin{bmatrix}I_{n}\\[-2.0pt] -I_{n}\end{bmatrix}\!\in\mathbb{R}^{2n\times n},\qquad W_{2}=\begin{bmatrix}I_{n}&-I_{n}\end{bmatrix}\!\in\mathbb{R}^{n\times 2n}, and bias −θ[𝟏n;𝟏n]-\theta[\mathbf{1}_{n};\mathbf{1}_{n}]. Then xk+1=W2ReLU(W1yk−θ[𝟏n𝟏n])=(yk−θ𝟏n)+−(−yk−θ𝟏n)+=𝒮θ(yk).\begin{split}x_{k+1}&=W_{2}\,\mathrm{ReLU}\!\Big(W_{1}y_{k}-\theta\!\begin{bmatrix}\mathbf{1}_{n}\\ \mathbf{1}_{n}\end{bmatrix}\Big)\\ &=(y_{k}-\theta\mathbf{1}_{n})_{+}-(-y_{k}-\theta\mathbf{1}_{n})_{+}=\mathcal{S}_{\theta}(y_{k}).\end{split} since (u−θ)+−(−u−θ)+=sign(u)(|u|−θ)+(u-\theta)_{+}-(-u-\theta)_{+}=\mathrm{sign}(u)\,(|u|-\theta)_{+} coordinatewise. With θ=γλ\theta=\gamma\lambda, this equals proxγλ∥⋅∥1⁡(yk)\operatorname{prox}_{\gamma\lambda\|\cdot\|_{1}}(y_{k}), i.e., the ISTA update. Full steps are provided in Appendix A.3. Figure 2(b) shows that the transformer construction matches ISTA’s convergence behavior across depth. ∎ Proposition 3.4. One (U) gradient step followed by the same fixed two-layer ReLU FFN wrapped in a scalar threshold loop performs the exact Euclidean projection onto the ℓ1\ell_{1}-ball: xk+1=Proj{‖x‖1≤B}(xk−γ(Axk+b)).x_{k+1}=\mathrm{Proj}_{\{\|x\|_{1}\leq B\}}\!\big(x_{k}-\gamma(Ax_{k}+b)\big). With 0<γ≤1/L0<\gamma\leq 1/L, this is projected gradient descent for (C), so xkx_{k} converges to an optimal solution. Construction. From the (U) head we form yk=xk−γ(Axk+b)y_{k}=x_{k}-\gamma(Ax_{k}+b). We then run a scalar threshold loop θt+1=θt+η[‖𝒮θt(yk)‖1−B]+,θ0=0, 0<η≤1n,\theta_{t+1}=\theta_{t}+\eta\,\big[\!\|\mathcal{S}_{\theta_{t}}(y_{k})\|_{1}-B\big]_{+},\qquad\theta_{0}=0,\ \ 0<\eta\leq\tfrac{1}{n}, and set xk+1=𝒮θ

Papers on Lattice

Total citations

Topics

Research focus

Architecture Design (Transformers, SSMs, MoE) (1)Training Efficiency & Optimization (1)

Frequent co-authors

Yufan Zhang (1)Ege Onur Taga (1)Samet Oymak (1)

Papers (1)

Feb 16, 2026

Kutay Tire +3Feb 16, 2026·also UMich

Covariance-Aware Transformers for Quadratic Programming and Decision Making

Transformers can directly solve quadratic programs and leverage covariance matrices for superior decision-making, outperforming traditional "predict-then-optimize" methods in portfolio construction.

Kutay Tire, Yufan Zhang, Ege Onur Taga +1

Architecture Design (Transformers, SSMs, MoE)Training Efficiency & Optimization

Search

Kutay Tire

Research focus

Frequent co-authors

Papers (1)