← Back to Home

Blog Archive

Blog

2026

Transformers Are (Naively) Looped Transformers, Horizontally

The standard Transformer applies the same weights at every sequence position — making it a horizontally looped, time-recurrent architecture. The natural generalization assigns per-position weight matrices. TTT realizes this via gradient descent, and turns out to be LocoProp applied along the time axis.

The ABC Research Mindset

A simple debiasing trick for judging new research ideas. By reordering the timeline of proposals, we can strip away the novelty premium and compare ideas on their actual merits. Applied to latent reasoning and transformer architecture debates.

Rethinking Key-Value Relationships in Linear Attention

We tested how DeltaNet handles diverse key-value relationships from identity to exponentials, scaling up to 32,000 tokens. Under standard normalized conditions, DeltaNet stays remarkably stable whether values come from k² or e^k. Information theory explains the bounds: DeltaNet retains 99.6% of information for identity but only 1.2% for exponentials, revealing fundamental compression limits.