Transformer: general memory computation system

10-03-2026

RNN basically has recurrence, which is memory bounded by temporal depth. That means there is one vector that is basically compressing the entire information of its evolution. Information bottleneck can be clearly seen here: in long sentences the residue of earlier information will vanish in a long arc.

When we have long dependencies, then we need to have the independent token relation with all of them, not like the dependent one. For example, in sentence token 2 effects 50, then in the RNN gradient pass through 48 steps, but it doesn't need to do that. The information between tokens 2 and 50 is independent of those 48, which means just the information between 2 and 50 is needed, and this independence is important, which is not present in RNN. This does not just have a vanishing gradient problem, but the deeper one is a hidden state that will overwrite information.

This is what the attention mechanism got right. Why should I move sequentially in a long computation chain if I can look into the tokens I want independently? So the dependency path complexity reduced from n to 1. The transformer discarded memory evolution and saw it as a communication problem. You see, this one is a big shift; instead of temporal memory, we got a relational communication system. This removes the information bottleneck where each token has an individual vector to compress information, so in RNN the token 50 accesses token 2 only through one compressed memory, but here token 50 can attend and look to token 2 directly. This prevents contamination of information compression. Why should the entire vector change when there is just a relation of one token to another one instead of all tokens?

One question I got is, "Why does changing path length from n to 1 matter?" because learning happens through two signals that travel through the network: one is information flow during the forward pass and the other is gradient flow during the backward pass in RNN. These signals travel through the entire long sequence, from which every intermediate step can distort the signal (e.g., telephone game noise), while the attention mechanism just has direct mapping, so information and gradient have no distortion. This worked because language relations are not sequential, but they are relational. Instead of sequential forms of graph dependencies let the model discover the graph directly.

Let's take the sentence "The trophy didn't fit into the suitcase because it was too big." What does "it" refer to here? and let's change the word "The trophy didn't fit into the suitcase because it was too small." Now what does "it" here refer to?

You see that meaning is only found when all information is accessible; let the model see the entire context. It again shows it's not the temporal depth problem; it's a communication problem." The above example shows the words' relations are not linear chain relations (it-> trophy, too big ->trophy, fit-> suitcase), which RNN assumes.

Two things matter: storage (Each token keeps its own vector.) and communication (Attention allows tokens to retrieve relevant information from others.) This is distributed memory + retrieval mechanism. where the entire thing shifted from time-based reasoning to relation-based reasoning.

Deeper interpretation of attention is its content-addressable memory, where each token asks, "Which other tokens are relevant for computing my representation?" and then retrieve information from them. but how it generalizes well with independent mapping because it learns relations, not positions. This leads to learning instead of memorization.

The classical memory in neural networks before transformers had state-based memory. where memory is the entire past stored in an evolving vector. So two problems are the compression bottleneck and the retrieval problem, where you cannot easily retrieve specific past information from the model. The model must remember everything implicitly. A transformer uses a completely different idea; instead of one memory slot, it has many (each token representation is a memory vector), and this is what I mean by distributed memory, where information is distributed across many vectors instead of compressed into one. Distributed memory makes retrieval easier. In the earlier example, "it" refers to a noun in a distributed memory system query ("it") → retrieve ("trophy"), while in RNN query ("it") → decode from a compressed hidden state, which is harder, so the transformer has these two functions: storage and retrieval.

The reason it's content-addressable memory is because "you retrieve information by similarity of content, not by location." So in a normal computer, you access memory by address (e.g., memory[42]), but in content-addressable memory, you retrieve ("animal concept"), so the system returns the closest stored representation.

When you understand this, the attention mechanism becomes inevitable, where its query token is compared with all stored tokens, where the comparison is similarity (query, key), and where the tokens with the highest similarity are retrieved. So attention is basically searching in memory using semantic similarity. So query Q means what information am I looking at? K is about what type of information this token contains. And then V is the information that should be retrieved. Its query → match keys → retrieve values means it's just a memory lookup table.

This resembles human cognition, where memory models as "memory as associative retrieval." You are in some state and find instant recall of associated things. We didn't look up an address, but our brain matched content similarity, and this is what transformers do. Clearly this beats sequential memory because to retrieve something you must decode the summary (the transformer has memory as a set of vectors, so it can search across them); this is why retrieval becomes parallel instead of sequential.

We can think of a transformer layer as two steps: step 1 is information retrieval (attention gathers relevant context) and information processing (feedforward layers transform the gathered information.) Each layer is retrieve context and process representation repeated many times. With this, another part of architecture becomes inevitable, as the question is, "Why multiple heads in the attention mechanism?" because the model can learn multiple relational graphs simultaneously.

So a transformer is a differentiable database that behaves as a query system + associative memory + parallel retrieval. For me the interesting part is how it relates to the structure of human attention, where humans process scenes like many representations in parallel, and the attention mechanism selecting relevant pieces shows the architecture implicitly discovered a similar computational principle. So architecture becomes distributed memory + content-based retrieval + local computation + iterative refinement.

The question does appear to be how large language models trained only to predict the next token begin to generalize well and perform multi-step reasoning. How can a system built only from memory retrieval and feed-forward transformations perform reasoning and generalize well?" The idea is just "retrieve context and transform representation" because reasoning is iterative refinement, for example. Question: John has 3 apples and buys 2 more. How many apples? For this model, we need to know how to recognize numbers, recognize operations, and apply addition. See the pattern? This is well suited for transformers because they bring together distant pieces of information. So reasoning becomes progressive representation transformation. That's why it's not statistical computation, as there is something deep here, as the model needs to learn something for what it's producing next token. And that's why scaling helped, as large models can store enormous amounts of latent structure. The model begins to represent deeper patterns in data and reason. So this is why the model generalized well.


thats why "attention is all you need."