Byte movement per unit learning

After writing kernels and building an understanding of the infrastructure behind scaling distributed model training, one realization became clear to me: at the level of first principles, scaling deep learning is the problem of maximizing useful learning per unit of information movement. The core question is: how little information can you move, how rarely can you synchronize it, and still make the model learn as if everything were globally connected?

At small scale, model training feels like mathematical computation. When we scale it to a data center, it starts to feel more like physics, where compute locality, bandwidth, latency, topology, synchronization, memory hierarchy, fault tolerance, and energy all matter.

Distributed training is where information flows across GPU memory, GPU interconnects, nodes, racks, data-center fabric, and storage. The bottleneck changes depending on scale. Sometimes compute is the bottleneck. Sometimes it is memory bandwidth, topology-aware algorithms, pipeline bubbles, stragglers, or checkpointing. The more global dependencies you have, the more cost you pay for information movement.

From this, we can see that useful scaling means high useful compute and low communication. A good architecture lets devices do a lot of local computation before requiring global communication. Every kind of parallelism, including data, pipeline, tensor, expert, and context parallelism, follows the same invariant: where should information live, when should it move, and how much must move?

We can see this below. Dense attention moves token-token interactions and faces a bottleneck from quadratic sequence communication, while linear attention and MLA use recurrent state summaries or compressed KV latents. This gives them higher local compute and less reliance on global communication. while dense attention gives more expressivity communication remains important conversely in linear attention and DeepSeek MLA attention has more local computation without much of communication.

             
computational circuit         Info moved                  Compute   Communication        Bottleneck
-----------------------------------------------------------------------------------------------
Dense attention               token-token interactions    medium    high                 context communication
Linear attention              recurrent state summary     high      lower                state compression
MLA                           compressed KV latent        high      lower                latent quality
MoE                           expert routing              high      sparse, irregular    load balance
Tensor parallel MLP/attention partial matmul outputs      high      frequent             all-reduce / all-gather
Data parallel                 gradients                   high      periodic             gradient sync

The main design principle is to shape the computation so less information needs to move in the first place. On one GPU, computation is what matters most, and communication is mostly internal memory movement. At scale, communication becomes an important part of the algorithm itself.