← Back to Home

Transformers Are (Naively) Looped Transformers, Horizontally

Weight sharing across positions makes the Transformer a time-recurrent loop — and TTT is its learned-optimizer cousin

Chunyuan Deng · May 2026


TL;DR

The standard Transformer is position-invariant: it applies the same weight matrices at every sequence position. This makes it a horizontally looped (time-recurrent) architecture — not just a deep one. The natural generalization is to assign a distinct weight matrix to every position. Test-Time Training (TTT) is exactly this generalization realized through gradient descent: the hidden state stores a fast-weight model that is updated position by position, giving per-position weights without storing a separate matrix for each step.

I. The Standard Transformer, Written Carefully

Fix a sequence of $T$ tokens. After embedding, we have

$$X \in \mathbb{R}^{T \times d}$$

where $T$ is the sequence length and $d$ is the model dimension. Denote the $t$-th row as $x_t \in \mathbb{R}^d$.

A single Transformer layer applies the same projection matrices $W_Q, W_K, W_V, W_O \in \mathbb{R}^{d \times d}$ to every position:

Self-Attention at position $t$
$$q_t = x_t W_Q, \quad k_t = x_t W_K, \quad v_t = x_t W_V \qquad q_t, k_t, v_t \in \mathbb{R}^d$$ $$a_t = \sum_{s \leq t} \frac{\exp(q_t \cdot k_s / \sqrt{d})}{\sum_{s' \leq t} \exp(q_t \cdot k_{s'} / \sqrt{d})}\, v_s \qquad a_t \in \mathbb{R}^d$$ $$o_t = a_t W_O \qquad o_t \in \mathbb{R}^d$$

The crucial observation: $W_Q, W_K, W_V, W_O$ carry no subscript $t$. Every position $t \in \{1, \ldots, T\}$ is processed by the identical linear maps. This is not a coincidence — it is a deliberate design choice called parameter sharing across time.

The Weight-Sharing Fact

In a standard Transformer with $L$ layers, the total number of weight parameters is $O(L \cdot d^2)$, independent of sequence length $T$. The same $d^2$ parameters are reused at every one of the $T$ positions within each layer.

II. Transformers as Horizontally Looped Architectures

Think of the computation graph of a single Transformer layer. It has two axes:

Because $W_Q, W_K, W_V, W_O$ are shared across the time axis, the Transformer is a loop unrolled in time:

Transformer as a Horizontal Loop

For each layer $\ell \in \{1,\ldots,L\}$ and position $t \in \{1,\ldots,T\}$:

$$h_t^{(\ell)} = f_\theta\!\left(h_t^{(\ell-1)},\; \{h_s^{(\ell-1)}\}_{s \leq t}\right)$$

where $f_\theta$ is the same function (same $\theta = \{W_Q, W_K, W_V, W_O, W_{\text{FF}}\}$) for all $t$.

This is precisely a recurrent computation along the time axis. The Transformer is not recurrent in the traditional RNN sense (it does not pass a hidden state vector from step to step), but it is parameter-recurrent: the same weight loop is applied at every time step.

Concretely, consider the per-position update at a fixed layer:

$$h_t \leftarrow f_\theta(h_t,\; \text{context up to } t), \quad t = 1, 2, \ldots, T$$

This is a loop over $T$ steps, each executing $f_\theta$. The Transformer is therefore a naively looped Transformer — naive in the sense that the loop body $f_\theta$ never changes across iterations.

III. The General, Non-Shared Version

The natural generalization removes the weight-sharing constraint: assign a distinct set of parameters to each position.

Per-Position (Non-Shared) Transformer
$$q_t = x_t W_Q^{(t)}, \quad k_t = x_t W_K^{(t)}, \quad v_t = x_t W_V^{(t)} \qquad W_Q^{(t)}, W_K^{(t)}, W_V^{(t)} \in \mathbb{R}^{d \times d}$$

Here the superscript $(t)$ indicates that each position $t$ has its own projection matrices. The total parameter count is now $O(T \cdot d^2)$ per layer — linear in sequence length.

For $T = 1{,}000{,}000$ and $d = 4{,}096$, this is $\approx 1.7 \times 10^{13}$ parameters per layer — obviously infeasible to store explicitly. Yet this is the conceptually correct, most expressive model: position 1 should arguably use different processing logic than position 1,000,000.

The standard Transformer is the special case $W^{(1)} = W^{(2)} = \cdots = W^{(T)} = W$: a single shared matrix used for all positions. Positional encodings are the only mechanism that breaks this symmetry, but they act on the inputs, not the weights.

Standard Transformer

Weights: $W^{(t)} = W$ for all $t$

Parameters: $O(d^2)$ per layer

Position awareness: via positional encodings on inputs

Expressivity: same function applied everywhere

Per-Position (General) Model

Weights: $W^{(t)}$ distinct for each $t$

Parameters: $O(T \cdot d^2)$ per layer

Position awareness: baked into the weights themselves

Expressivity: different function at every position

The gap between these two extremes is enormous. Can we find a tractable middle ground that achieves position-dependent processing without storing $T$ full weight matrices?

IV. TTT: Horizontal Looping with Learned Gradient Steps

Test-Time Training (TTT) [Sun et al., 2024] closes this gap by replacing the static weight $W$ with a fast-weight model $W_t$ that is updated at each position via gradient descent. The key idea is that instead of storing a per-position weight matrix, we generate it on the fly by running a small optimization.

Setup: The Fast-Weight Model

At each position $t$, TTT maintains a weight matrix $W_t \in \mathbb{R}^{d \times d}$ (or a small neural network parameterized by $W_t$). The hidden state of the sequence model is $W_t$.

For each incoming token $x_t \in \mathbb{R}^d$, TTT constructs a self-supervised task:

TTT Self-Supervised Loss at Position $t$
$$\mathcal{L}_t(W) = \left\| W \cdot k_t - v_t \right\|^2$$

where $k_t = x_t W_K \in \mathbb{R}^d$ is the key and $v_t = x_t W_V \in \mathbb{R}^d$ is the target value, with $W_K, W_V \in \mathbb{R}^{d \times d}$ being outer (slow) weights shared across all positions.

This is a simple ridge regression objective: the fast-weight matrix $W_t$ is asked to map the current key to the current value.

The Recurrent Update Rule

The weight is updated by a gradient step:

TTT Update (one gradient step)
$$W_t = W_{t-1} - \eta \,\nabla_W \mathcal{L}_t(W_{t-1})$$ $$= W_{t-1} - \eta \left(W_{t-1} k_t - v_t\right) k_t^\top$$

where $\eta > 0$ is a learned step size (scalar or per-parameter), and the gradient $\nabla_W \mathcal{L}_t(W) = (Wk_t - v_t)k_t^\top \in \mathbb{R}^{d \times d}$ is the outer product of the residual and the key.

The output at position $t$ is then read out by querying the updated model:

$$z_t = W_t \cdot q_t \qquad z_t \in \mathbb{R}^d$$

where $q_t = x_t W_Q \in \mathbb{R}^d$ is the query, again using a shared slow weight $W_Q \in \mathbb{R}^{d \times d}$.

Why This is Per-Position Weights

After $t$ gradient steps, the fast weight $W_t$ encodes the accumulated gradient information from all positions $1, \ldots, t$:

$$W_t = W_0 - \eta \sum_{s=1}^{t} \left(W_{s-1} k_s - v_s\right) k_s^\top$$

Each $W_t$ is distinct — it depends on the entire history $(x_1, \ldots, x_t)$. This is precisely the per-position weight matrix $W^{(t)}$ from the general model, but expressed implicitly through gradient accumulation rather than stored explicitly.

TTT = Horizontal Looping + Gradient Descent as the Loop Body

The standard Transformer loops the same static function $f_W$ over positions. TTT loops a function whose weights change at each step via gradient descent. The loop body is not $f_W$ but $f_{W_t}$, and $W_t$ evolves by a local optimization step at each position.

This gives TTT the expressivity of per-position weights without the $O(T \cdot d^2)$ storage cost. The trade-off: the per-position weight is determined by a fixed optimization trajectory (gradient descent from $W_0$), not a freely learned mapping.

V. The LocoProp Connection

The TTT update is more than vanilla gradient descent — it is closely related to LocoProp [Amid et al., 2022], a local propagation algorithm for training neural networks layer by layer.

LocoProp decomposes the global training objective into local per-layer targets. Each layer is trained to minimize:

LocoProp Local Objective
$$\mathcal{L}^{\text{loco}}_\ell(W_\ell) = \left\| W_\ell \cdot h_{\ell-1} - \hat{h}_\ell \right\|^2 + \lambda \left\| W_\ell - W_\ell^{\text{prev}} \right\|^2_F$$

where $h_{\ell-1} \in \mathbb{R}^d$ is the input to layer $\ell$, $\hat{h}_\ell \in \mathbb{R}^d$ is the local target (a detached signal from the next layer's gradient), $W_\ell^{\text{prev}}$ is the previous iterate, and $\|\cdot\|_F$ denotes the Frobenius norm.

TTT's self-supervised loss $\mathcal{L}_t(W) = \|W k_t - v_t\|^2$ is structurally identical, with the key $k_t$ playing the role of the layer input and the value $v_t$ as the local target. TTT is therefore LocoProp applied horizontally across time rather than vertically across depth.

LocoProp (Vertical)

Axis: depth (layer $\ell$)

Target: signal from adjacent layer

Update: local regression per layer

Purpose: credit assignment without full backprop

TTT (Horizontal)

Axis: time (position $t$)

Target: value $v_t$ at current position

Update: local regression per position

Purpose: per-position weight adaptation

VI. Connecting Everything: A Unified View

We can now arrange these architectures on a spectrum of how much weight sharing they enforce along the time axis:

Spectrum of Time-Axis Weight Sharing

Full sharing (Standard Transformer): $W^{(t)} = W$ for all $t \in \{1,\ldots,T\}$. One matrix, reused everywhere. Cheapest, but position-invariant.

Gradient-accumulated sharing (TTT): $W^{(t)} = W_0 - \eta\sum_{s \leq t} g_s$, where $g_s$ is the gradient at step $s$. Position-dependent, but constrained to a gradient descent trajectory. $O(d^2)$ hidden state.

No sharing (Per-Position Model): $W^{(t)}$ freely chosen for each $t$. Most expressive, but $O(T \cdot d^2)$ parameters — infeasible for long sequences.

The standard Transformer's position-invariance is its defining limitation at long context. When processing position $t = 1{,}000{,}000$, the model uses the exact same weight matrices as at position $t = 1$. The only differentiation comes from the attention pattern over context, not from the processing function itself.

TTT breaks this invariance in a principled, memory-efficient way. The fast weight $W_t$ at position $t$ encodes a compressed summary of all keys and values seen so far, updated via a gradient step that costs only $O(d^2)$ per position — the same asymptotic cost as a standard attention operation.

VII. Summary

References