Weight sharing across positions makes the Transformer a time-recurrent loop — and TTT is its learned-optimizer cousin
Fix a sequence of $T$ tokens. After embedding, we have
$$X \in \mathbb{R}^{T \times d}$$where $T$ is the sequence length and $d$ is the model dimension. Denote the $t$-th row as $x_t \in \mathbb{R}^d$.
A single Transformer layer applies the same projection matrices $W_Q, W_K, W_V, W_O \in \mathbb{R}^{d \times d}$ to every position:
The crucial observation: $W_Q, W_K, W_V, W_O$ carry no subscript $t$. Every position $t \in \{1, \ldots, T\}$ is processed by the identical linear maps. This is not a coincidence — it is a deliberate design choice called parameter sharing across time.
In a standard Transformer with $L$ layers, the total number of weight parameters is $O(L \cdot d^2)$, independent of sequence length $T$. The same $d^2$ parameters are reused at every one of the $T$ positions within each layer.
Think of the computation graph of a single Transformer layer. It has two axes:
Because $W_Q, W_K, W_V, W_O$ are shared across the time axis, the Transformer is a loop unrolled in time:
For each layer $\ell \in \{1,\ldots,L\}$ and position $t \in \{1,\ldots,T\}$:
$$h_t^{(\ell)} = f_\theta\!\left(h_t^{(\ell-1)},\; \{h_s^{(\ell-1)}\}_{s \leq t}\right)$$where $f_\theta$ is the same function (same $\theta = \{W_Q, W_K, W_V, W_O, W_{\text{FF}}\}$) for all $t$.
This is precisely a recurrent computation along the time axis. The Transformer is not recurrent in the traditional RNN sense (it does not pass a hidden state vector from step to step), but it is parameter-recurrent: the same weight loop is applied at every time step.
Concretely, consider the per-position update at a fixed layer:
$$h_t \leftarrow f_\theta(h_t,\; \text{context up to } t), \quad t = 1, 2, \ldots, T$$This is a loop over $T$ steps, each executing $f_\theta$. The Transformer is therefore a naively looped Transformer — naive in the sense that the loop body $f_\theta$ never changes across iterations.
The natural generalization removes the weight-sharing constraint: assign a distinct set of parameters to each position.
Here the superscript $(t)$ indicates that each position $t$ has its own projection matrices. The total parameter count is now $O(T \cdot d^2)$ per layer — linear in sequence length.
For $T = 1{,}000{,}000$ and $d = 4{,}096$, this is $\approx 1.7 \times 10^{13}$ parameters per layer — obviously infeasible to store explicitly. Yet this is the conceptually correct, most expressive model: position 1 should arguably use different processing logic than position 1,000,000.
The standard Transformer is the special case $W^{(1)} = W^{(2)} = \cdots = W^{(T)} = W$: a single shared matrix used for all positions. Positional encodings are the only mechanism that breaks this symmetry, but they act on the inputs, not the weights.
Weights: $W^{(t)} = W$ for all $t$
Parameters: $O(d^2)$ per layer
Position awareness: via positional encodings on inputs
Expressivity: same function applied everywhere
Weights: $W^{(t)}$ distinct for each $t$
Parameters: $O(T \cdot d^2)$ per layer
Position awareness: baked into the weights themselves
Expressivity: different function at every position
The gap between these two extremes is enormous. Can we find a tractable middle ground that achieves position-dependent processing without storing $T$ full weight matrices?
Test-Time Training (TTT) [Sun et al., 2024] closes this gap by replacing the static weight $W$ with a fast-weight model $W_t$ that is updated at each position via gradient descent. The key idea is that instead of storing a per-position weight matrix, we generate it on the fly by running a small optimization.
At each position $t$, TTT maintains a weight matrix $W_t \in \mathbb{R}^{d \times d}$ (or a small neural network parameterized by $W_t$). The hidden state of the sequence model is $W_t$.
For each incoming token $x_t \in \mathbb{R}^d$, TTT constructs a self-supervised task:
where $k_t = x_t W_K \in \mathbb{R}^d$ is the key and $v_t = x_t W_V \in \mathbb{R}^d$ is the target value, with $W_K, W_V \in \mathbb{R}^{d \times d}$ being outer (slow) weights shared across all positions.
This is a simple ridge regression objective: the fast-weight matrix $W_t$ is asked to map the current key to the current value.
The weight is updated by a gradient step:
where $\eta > 0$ is a learned step size (scalar or per-parameter), and the gradient $\nabla_W \mathcal{L}_t(W) = (Wk_t - v_t)k_t^\top \in \mathbb{R}^{d \times d}$ is the outer product of the residual and the key.
The output at position $t$ is then read out by querying the updated model:
$$z_t = W_t \cdot q_t \qquad z_t \in \mathbb{R}^d$$where $q_t = x_t W_Q \in \mathbb{R}^d$ is the query, again using a shared slow weight $W_Q \in \mathbb{R}^{d \times d}$.
After $t$ gradient steps, the fast weight $W_t$ encodes the accumulated gradient information from all positions $1, \ldots, t$:
$$W_t = W_0 - \eta \sum_{s=1}^{t} \left(W_{s-1} k_s - v_s\right) k_s^\top$$Each $W_t$ is distinct — it depends on the entire history $(x_1, \ldots, x_t)$. This is precisely the per-position weight matrix $W^{(t)}$ from the general model, but expressed implicitly through gradient accumulation rather than stored explicitly.
The standard Transformer loops the same static function $f_W$ over positions. TTT loops a function whose weights change at each step via gradient descent. The loop body is not $f_W$ but $f_{W_t}$, and $W_t$ evolves by a local optimization step at each position.
This gives TTT the expressivity of per-position weights without the $O(T \cdot d^2)$ storage cost. The trade-off: the per-position weight is determined by a fixed optimization trajectory (gradient descent from $W_0$), not a freely learned mapping.
The TTT update is more than vanilla gradient descent — it is closely related to LocoProp [Amid et al., 2022], a local propagation algorithm for training neural networks layer by layer.
LocoProp decomposes the global training objective into local per-layer targets. Each layer is trained to minimize:
where $h_{\ell-1} \in \mathbb{R}^d$ is the input to layer $\ell$, $\hat{h}_\ell \in \mathbb{R}^d$ is the local target (a detached signal from the next layer's gradient), $W_\ell^{\text{prev}}$ is the previous iterate, and $\|\cdot\|_F$ denotes the Frobenius norm.
TTT's self-supervised loss $\mathcal{L}_t(W) = \|W k_t - v_t\|^2$ is structurally identical, with the key $k_t$ playing the role of the layer input and the value $v_t$ as the local target. TTT is therefore LocoProp applied horizontally across time rather than vertically across depth.
Axis: depth (layer $\ell$)
Target: signal from adjacent layer
Update: local regression per layer
Purpose: credit assignment without full backprop
Axis: time (position $t$)
Target: value $v_t$ at current position
Update: local regression per position
Purpose: per-position weight adaptation
We can now arrange these architectures on a spectrum of how much weight sharing they enforce along the time axis:
Full sharing (Standard Transformer): $W^{(t)} = W$ for all $t \in \{1,\ldots,T\}$. One matrix, reused everywhere. Cheapest, but position-invariant.
Gradient-accumulated sharing (TTT): $W^{(t)} = W_0 - \eta\sum_{s \leq t} g_s$, where $g_s$ is the gradient at step $s$. Position-dependent, but constrained to a gradient descent trajectory. $O(d^2)$ hidden state.
No sharing (Per-Position Model): $W^{(t)}$ freely chosen for each $t$. Most expressive, but $O(T \cdot d^2)$ parameters — infeasible for long sequences.
The standard Transformer's position-invariance is its defining limitation at long context. When processing position $t = 1{,}000{,}000$, the model uses the exact same weight matrices as at position $t = 1$. The only differentiation comes from the attention pattern over context, not from the processing function itself.
TTT breaks this invariance in a principled, memory-efficient way. The fast weight $W_t$ at position $t$ encodes a compressed summary of all keys and values seen so far, updated via a gradient step that costs only $O(d^2)$ per position — the same asymptotic cost as a standard attention operation.