May 2026
Transformers Are (Naively) Looped Transformers, Horizontally
The standard Transformer applies the same weights at every sequence position — making it a horizontally looped, time-recurrent architecture. The natural generalization assigns per-position weight matrices. TTT realizes this via gradient descent, and turns out to be LocoProp applied along the time axis.
February 2026
The ABC Research Mindset
A simple debiasing trick for judging new research ideas. By reordering the timeline of proposals, we can strip away the novelty premium and compare ideas on their actual merits. Applied to latent reasoning and transformer architecture debates.
February 2026
Rethinking Key-Value Relationships in Linear Attention
We tested how DeltaNet handles diverse key-value relationships from identity to exponentials, scaling up to 32,000 tokens. Under standard normalized conditions, DeltaNet stays remarkably stable whether values come from k² or e^k. Information theory explains the bounds: DeltaNet retains 99.6% of information for identity but only 1.2% for exponentials, revealing fundamental compression limits.