LT2: Linear-Time Looped Transformers

1 Rice University 2 Apple 3 UC Santa Cruz 4 Carnegie Mellon University
LT2 teaser
Figure 1. (Left) LT2 occupies a new region of the parameter-efficiency frontier: for the same parameter budget, LT2 models achieve better quality with far lower inference cost than standard Looped Transformers. (Right) After distillation from a pre-trained full-attention Looped Transformer, Ouro-hybrid-1.4B is competitive with industry-level 3B–4B models while inheriting LT2's linear-time inference.

Overview

FLOPs vs sequence length
The scaling problem of looped full attention. Attention FLOPs (left) and KV-cache memory (right) for a 1.3B model vs. sequence length. Because each loop re-runs full attention, both costs compound with the number of loops. LT2's linear/sparse mixers keep both curves flat regardless of loop count.

Looped Transformers (LT) are an elegant idea: instead of stacking many independently-parameterized layers, reuse the same block of weights $T$ times before producing the output token. This gives $T\times$ the effective depth at $1\times$ the parameter count — a compelling handle for parameter-efficient reasoning at inference time.

But there is a catch. Each loop re-runs full quadratic self-attention over the entire sequence. FLOPs grow as $\mathcal{O}(L^2)$ per loop iteration, and the KV-cache grows as $\mathcal{O}(T \cdot L)$ at inference. As you add more loops to get more reasoning depth, the attention cost compounds — exactly where you want to scale, the architecture becomes most expensive.

LT2 (Linear-Time Looped Transformers) asks: can we keep the looping, but cut the attention cost? We replace full softmax attention inside each loop with subquadratic token mixers — linear attention and sparse attention — and find that looping and efficient attention are not just compatible, but genuinely synergistic. The loop changes what the efficient mixer can do, not just how many times it runs.


Subquadratic Attention in Looped Transformers

Architecture formulation

A standard Transformer of depth $N$ stacks $N$ independently-parameterized blocks $\{\mathcal{F}_\ell\}_{\ell=1}^{N}$:

$$\mathcal{F}_\ell(\mathbf{h}) = \mathbf{h}' + \mathrm{FFN}_\ell(\mathbf{h}'), \qquad \mathbf{h}' = \mathbf{h} + \mathrm{MHA}_\ell(\mathbf{h}).$$

A Looped Transformer (LT) reuses these $N$ shared blocks for $T$ iterations:

$$\mathbf{h}^{(0)} = \mathrm{Emb}(\mathbf{x}), \quad \mathbf{h}^{(\tau)} = \bigl(\mathcal{F}_N \circ \cdots \circ \mathcal{F}_1\bigr)\!\bigl(\mathbf{h}^{(\tau-1)}\bigr), \quad \tau = 1, \ldots, T,$$

yielding effective depth $T \cdot N$ with only $N$ unique parameter sets. Each $\mathrm{MHA}_\ell$ costs $\mathcal{O}(L^2)$ FLOPs and the KV-cache at inference is $\mathcal{O}(T \cdot L)$ — both scale linearly with $T$. LT2 replaces MHA with a subquadratic token mixer:

$$\mathbf{h}' = \mathbf{h} + \mathrm{LinearMixer}_\ell(\mathbf{h}),$$

keeping the looping, weight sharing, and a learned per-loop residual gate $\mathbf{h}^{(\tau)} = \widetilde{\mathbf{h}}^{(\tau)} + \boldsymbol{\rho}_\tau \odot \mathbf{h}^{(\tau-1)}$ unchanged. Beyond efficiency, looping amplifies the expressive power of subquadratic mixers in two distinct ways.

Linear attention: rank-$T$ memory update

Frontier linear-attention architectures (GDN, KDA, RWKV7) maintain a fixed-size recurrent state $\mathbf{S}_t \in \mathbb{R}^{d_k \times d_v}$ via a DPLR operator:

$$\mathbf{S}_t = \mathbf{A}_t\,\mathbf{S}_{t-1} + \beta_t\,\mathbf{k}_t\mathbf{v}_t^{\top}, \qquad \mathbf{A}_t = \mathrm{Diag}(\boldsymbol{\alpha}_t)\bigl(\mathbf{I} - \beta_t\,\mathbf{k}_t\mathbf{k}_t^{\top}\bigr).$$

The matrix $\mathbf{A}_t$ is identity plus a rank-1 perturbation, so a single non-looped DPLR block can only modify recurrent memory along one direction per token. When looped $T$ times, the cumulative state-transition operator across all iterations is:

$$\mathbf{A}_t^{\mathrm{eff}} = \prod_{\tau=1}^{T} \mathbf{A}_t^{(\tau)} = \prod_{\tau=1}^{T} \mathrm{Diag}\!\bigl(\boldsymbol{\alpha}_t^{(\tau)}\bigr)\!\left(\mathbf{I} - \beta_t^{(\tau)}\,\mathbf{k}_t^{(\tau)}\mathbf{k}_t^{(\tau)\top}\right).$$

When the per-loop keys $\{\mathbf{k}_t^{(\tau)}\}$ are orthogonal (which diverse intermediate representations approach in practice), the product erases $T$ distinct directions in memory — yielding a rank-$T$ perturbation and directly multiplying the state-tracking capacity without any added parameters.

Sparse attention: $\mathcal{O}(Tw)$ receptive field

A sliding-window block with window $w$ restricts each query at position $t$ to attend only to tokens $\mathcal{I}_t^{(1)} = \{t - w + 1, \ldots, t\}$. After $T$ loop iterations, information propagates further each loop, and chaining this inductively gives:

$$\mathcal{I}_t^{(T)} \supseteq \bigl\{\max(1,\, t - Tw + 1),\, \ldots,\, t\bigr\}, \qquad \bigl|\mathcal{I}_t^{(T)}\bigr| = \mathcal{O}(Tw).$$

$T$ loops reach as far back as $T$ stacked independent window-$w$ layers but with $T\times$ fewer parameters. Looping turns compute into context: a fixed local window covers arbitrary sequence lengths once $T$ is large enough.


Architecture

LT2 swaps the MHA sub-layer inside the shared block with a subquadratic token mixer. The looping structure, weight sharing, and learned per-loop residual gate remain identical. We study two base variants and two hybrid families.

Variant Mixer Complexity Core benefit from looping
LT2-linear GDN (Gated Delta Net) $\mathcal{O}(L)$ Rank-$T$ recurrent state update; most stable optimizer behavior
LT2-sparse DSA (Dynamic Sparse Attn) $\mathcal{O}(L \log L)$ Effective receptive field grows to $\mathcal{O}(Tw)$ across loops
LT2-hybrid (GDN+DSA) Efficient GDN + Sparse Attn $\mathcal{O}(L)$ Linear branch compresses; sparse branch handles precise retrieval — no full attention
LT2-hybrid (Full+GDN) Best Quality Small fraction Full Attn + GDN Near-linear GDN regularizes the loop; sparse full-attention layers handle hard retrieval cases

We explore two hybridization strategies. In depth-level mixing, different attention types are interleaved across layers inside the shared block — the same hybrid stack runs every loop iteration. In loop-level mixing, the mixer type varies across iterations, e.g. full attention in loop 1 with progressively narrower windows in later loops. Ablations consistently favor depth-level mixing: distributing attention across depth matters more than scheduling it across time.


Zero-Shot Downstream Performance

We evaluate zero-shot downstream accuracy across eight benchmarks (ARC-E/C, HellaSwag, PIQA, Winogrande, OBQA, SciQ, BoolQ) at two scales: 0.6B and 1.3B parameters, 100B FineWeb-Edu tokens, $T=4$ loops. D-Gate = data-dependent gating; $\Delta$ = DPLR linear variant.

Table 1. Zero-shot accuracy and perplexity across two scales. Cream rows are the best LT2 model without full attention. Bold = best per column within scale; underline = second best.

Model D-Gate Δ PPL ↓ ARC-E ARC-C HellaS. PIQA WG OBQA SciQ BoolQ Avg. ↑
0.6B parameters / 100B tokens (8× Chinchilla ratio)
Transformer13.1463.0930.7247.4369.5356.2435.668.250.0751.34
Looped Transformer (ref)11.9267.1334.6753.2970.5862.8338.273.654.8756.42
LT2-linear attention
Looped RetNettraining diverged
Looped HGRN214.5959.8227.9343.1767.3452.1333.465.248.5349.69
Looped Mamba212.7864.5331.8249.8769.7458.6335.668.851.8353.86
Looped DeltaNet14.1660.4728.5344.2267.8753.2433.865.549.1350.12
Looped GDN12.0666.4333.8952.6270.2761.4836.470.554.1355.74
Looped KDA12.1366.1233.6352.3770.1361.2236.270.253.9255.49
LT2-sparse attention
Looped Window12.8764.2331.5348.8369.8257.3435.868.551.2352.17
Looped NSA12.3065.5732.7451.4370.0460.3236.069.553.1354.84
Looped DSA12.0866.3733.8252.5370.2361.4236.470.454.0755.67
Hybrid LT2
Looped Hybrid (Full+Window)12.2465.3232.1351.2369.8658.4236.069.253.1354.43
Looped Hybrid (Full+DSA)12.2065.5332.3451.4270.0458.6336.269.453.3254.62
Looped Hybrid (Full+GDN)11.4369.8237.3455.8372.6264.6138.973.357.7458.65
Looped Hybrid (GDN+DSA)11.8567.4334.5353.4270.6362.9237.071.255.1356.53
1.3B parameters / 100B tokens (4× Chinchilla ratio)
Transformer10.6567.5233.8452.4771.0361.4836.671.354.0256.04
Looped Transformer (ref)9.8770.8337.5457.0672.4365.8338.674.157.8359.27
LT2-linear attention
Looped Mamba210.3069.4736.6355.9472.6864.3738.273.057.0358.43
Looped GDN9.7571.2838.3357.7373.3766.2639.174.358.7859.92
Looped KDA9.6871.5738.6257.9973.5366.4239.374.658.9860.14
LT2-sparse attention
Looped Window10.4268.4335.4754.8771.3263.2336.971.755.8757.23
Looped NSA10.1769.0235.9755.0871.5264.0337.272.256.5357.72
Looped DSA9.9769.9336.9356.3871.9464.8737.772.957.4258.54
Hybrid LT2
Looped Hybrid (Full+Window)9.8470.9337.1256.6873.1264.3438.873.358.5659.13
Looped Hybrid (Full+DSA)9.8071.1337.2856.8473.2464.5238.973.458.7359.28
Looped Hybrid (Full+GDN)9.1274.8241.6361.0475.9369.5241.375.462.0462.89
Looped Hybrid (GDN+DSA)9.5072.4439.3358.8473.9867.1339.774.959.7760.73

Long-Context Evaluation

Long-context evaluation at 1.3B parameters across two task types: knowledge benchmarks at 2048 tokens (SWDE, SQuAD, FDA, TriviaQA, NQ, DROP) and needle-in-a-haystack (NIAH) at 1024, 2048, and 4096 tokens. Models pre-trained at 2048 must extrapolate to 4096 for NIAH-3.

Table 2. Long-context evaluation. Bold = best per column; underline = second best.

Model Knowledge (2048 ctx) NIAH-Single-1 NIAH-Single-2 NIAH-Single-3
SWDESQuADFDATQANQDROP 1k2k4k 1k2k4k 1k2k4k
Non-looped baselines
Transformer 48.946.658.467.531.726.4 1001000.0 92.21000.0 98.699.40.0
GDN 32.740.028.363.525.724.5 10010099.8 10093.849.8 83.868.434.2
Mamba-2 30.739.123.764.325.128.5 10099.662.0 10053.811.8 95.887.413.4
Looped variants
Looped Transformer 52.849.461.768.233.628.1 1001000.0 94.61000.0 99.299.80.0
Looped GDN 34.941.830.664.727.025.9 10010099.8 10096.453.2 85.671.035.8
Looped Mamba-2 33.940.525.865.126.829.7 10010065.7 10057.113.5 96.288.116.2
LT2 hybrid variants
Looped Hybrid (GDN+DSA) 51.648.060.466.933.028.4 10010093.5 10010077.6 10099.660.3
Looped Hybrid (Full+GDN) 53.148.962.067.834.030.2 10010093.5 10010081.0 99.899.863.7

A striking pattern emerges: the standard Looped Transformer scores well on knowledge tasks but fails entirely at NIAH beyond its training context (scores of 0.0 at 4k). LT2 hybrid variants — especially GDN+DSA — successfully extrapolate because GDN's fixed-size state and the DSA's dynamic sparse cache together avoid the hard cutoff of the full-attention KV cache.


Training Stability Under Looping

A practical concern when sharing weights across loop iterations is optimization stability. The same block runs $T$ times — any pathology can compound. We track gradient norms and loss curves throughout pre-training at 1.3B parameters, 100B tokens.

Attention sinks compound across loops

In standard Transformers, softmax attention concentrates mass on "sink" tokens — typically the first token. In a looped model this is worse: the sink learned in loop $\tau$ is re-injected into loop $\tau\!+\!1$ rather than reset, causing a compounding sawtooth pattern. We fix this with a per-head sigmoid gate after SDPA, applied inside the shared block so gate weights are reused every iteration. It suppresses the sawtooth almost entirely and yields consistent downstream improvements.

Attention sink across loops
Unrolled diagnostics for the Looped Transformer ($T=4$, 24 layers). Dashed lines mark loop boundaries. (a) First-token attention mass forms a sawtooth intensifying each loop. (b) Max FFN-residual activation follows the same pattern. (c) Residual-stream RMS norm grows across depth and loops. The SDPA output gate (blue) flattens (a) and (b) and mitigates the cross-loop growth in (c).
Linear stability
Linear variants. Looped GDN has the smoothest loss and lowest gradient norms — better than the full-attention loop. RetNet, lacking both data-dependent gating and the delta rule, diverges entirely.
Hybrid stability
Hybrid variants. Both hybrids match the Looped Transformer from the start and pull slightly ahead, while gradient norms remain consistently smaller and spike-free.

Mixers with data-dependent gating and a delta rule (GDN, hybrids with it) train more stably under looping than vanilla full attention. The gate lets recurrence forget stale state; the delta rule bounds updates to memory. Missing either ingredient is noisier; missing both (RetNet) is unstable.

Table 3. Effect of the SDPA output gate on softmax-containing variants. "First-tok." is mean attention mass on token 1 — lower means less sink behavior.

Model Output gate PPL ↓ Avg. ↑ First-tok. ↓
Looped Transformer 9.8759.270.51
9.8259.690.04
Δ−0.05+0.42−0.47
Looped Hybrid (Full+GDN) 9.3161.390.38
9.2861.660.05
Δ−0.03+0.27−0.33
Looped Hybrid (GDN+DSA) 9.7259.230.29
9.7059.410.06
Δ−0.02+0.18−0.23

Hybrid Design Ablations

We ablate three design dimensions of the hybrid architecture at 1.3B / $T=4$ / 100B tokens: how much full attention to mix in (ratio), where to place it (pattern), and along which axis (level).

Table 4. Hybrid LT2 ablations. Cream rows mark the best per group.

Configuration Full:GDN Pattern / Schedule PPL ↓ Avg. ↑
(1) Hybrid ratio — depth-interleaved
Looped Transformer (ref)1:09.8759.27
Hybrid 1:11:1interleave9.4160.92
Hybrid 1:4 (default)1:4interleave9.3161.39
Hybrid 1:61:6interleave9.3661.07
Hybrid 1:121:12interleave9.7459.51
Looped GDN0:110.0258.42
(2) Hybrid pattern — ratio fixed at 1:4, depth-level
Bookend1:4Full at top & bottom, GDN in middle9.2761.52
Interleave (default)1:4every 5th layer is Full9.3161.39
Front-loaded1:4all Full layers at bottom of stack9.4560.61
Back-loaded1:4all Full layers at top of stack9.5360.43
(3) Hybridization level — matched parameters
Random sample + majority vote (K=5)1:4resample 1/5 Full per step; vote at eval9.2661.55
Depth-level (default)1:4per-layer Full/GDN interleave9.3161.39
Loop-level coarse→fineFull → SWA-512 → SWA-256 → SWA-1289.3660.71
Loop-level fine→coarseSWA-128 → SWA-256 → SWA-512 → Full9.4261.10

A 1:4 (Full:GDN) ratio is the sweet spot — halving it to 1:1 improves perplexity but costs more attention compute; pushing to 1:12 falls below the Looped Transformer baseline. Pattern matters less than ratio: bookend slightly edges out interleave, suggesting the key role of full attention is at the input and output of the block. Depth-level and loop-level mixing perform similarly, but depth-level is simpler to implement.


You Don't Need to Train from Scratch

A pre-trained full-attention Looped Transformer (Ouro-1.4B) can be converted into an LT2-hybrid model through a three-stage distillation recipe — keeping the embeddings, FFN, and norm parameters, and replacing the attention layers with GDN, then restoring a small fraction of full-attention layers.

The three-stage recipe

Stage 1 — Linear pre-alignment (100M tokens). With all attention replaced by GDN, align each GDN block to its teacher's attention output via MSE on the residual stream. This warm start avoids gradient instability from random initialization.

Stage 2 — Hybrid logit distillation (600M tokens). Restore the 6 most important full-attention layers (KL-guided selection), then distill on teacher logits. The per-loop KL weight schedule is a new design knob for the looped setting:

$$\mathcal{L}_{\mathrm{KD}} = \sum_{\tau=1}^{T} w_t^{(\tau)}\, \mathrm{KL}\!\left(\sigma_{\mathrm{top}\text{-}k}\!\bigl(z_\mathcal{T}^{(\tau)}/T_{\!\mathrm{kd}}\bigr) \,\Big\|\, \sigma_{\mathrm{top}\text{-}k}\!\bigl(z_\mathcal{S}^{(\tau)}/T_{\!\mathrm{kd}}\bigr)\right).$$

We progressively warm up per-loop supervision, run uniform weights across loops, then switch to final-output supervision only. Per-loop supervision gives a more stable gradient signal and especially improves multi-key retrieval.

Stage 3 — Long-context continuation (600M tokens at 32k length). Continue training on long reasoning sequences. Progressive length expansion is essential — jumping directly to 32k degrades long-context performance.

Capability retention after distillation
Capability retention across tasks after distillation. Our method (Ouro-Hybrid, layer-selection) retains significantly more of the teacher's capability compared to a uniform interleave baseline and the previous state-of-the-art distillation recipe.
RULER subtask breakdown
RULER subtask breakdown. Per-loop supervision provides a more stable gradient signal than final-output-only supervision, with the largest gains on multi-key retrieval — the task requiring the most complex state tracking across loops.

Limitations and Open Questions

Two directions remain unexplored. First, we study depth-level hybridization and simple loop-level schedules but not full loop-level hybridization — where different iterations use fundamentally distinct attention families. Our ablations suggest loop-level mixing underperforms depth-level in current settings, but smarter schedules may exist.

Second, we do not design explicit cross-loop state-carry mechanisms. Loop iterations currently communicate only through the residual stream. A principled recurrent state passed explicitly across loop boundaries could further improve long-context modeling and compute efficiency — especially for linear attention variants where the recurrent state is well-defined.


BibTeX

@misc{deng2026lt2lineartimeloopedtransformers,
  title         = {{LT2}: Linear-Time Looped Transformers},
  author        = {Chunyuan Deng and Yizhe Zhang and Rui-Jie Zhu and Yuanyuan Xu and Jiarui Liu and T. S. Eugene Ng and Hanjie Chen},
  year          = {2026},
  eprint        = {2605.20670},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2605.20670},
}