Skip to main content

Distributed Computing

Gemini-authored.

The Distributed Mind: From Chaos to Consensus

The transition from single-node programming to distributed systems is not merely a change in scale; it is a change in physics. In a single process, time is absolute, memory is shared, and truth is singular. In a distributed system, time is relative, memory is partitioned, and truth is merely an agreement between unreliable observers.

This document is a reconstruction of the mental models required to build Prime, a distributed training system that survives the chaos of the WAN. We will not start with code. We will start with the fundamental limitations of the universe—causality, relativity, and the impossibility of perfect consensus—and build our way up to the architecture of DiLoCo.


Part I: The Fragility of Time

1.1 The Illusion of Now

In a centralized system, we rely implicitly on the wall clock. If we write a log entry at 12:00:01 and another at 12:00:02, we assume the first happened before the second. In a distributed system, this assumption is dangerous.

Physical clocks are imperfect. They consist of oscillating crystals that drift due to temperature, voltage, and age. Even with NTP (Network Time Protocol), clocks across a cluster can skew by milliseconds or seconds. 1 In a high-frequency trading system or a high-throughput parameter server, a 5ms skew is an eternity.

If Node A sends a message at physical time \(T=100\) and Node B receives it at physical time \(T=90\) (because B’s clock is slow), the system perceives an effect preceding its cause. This violation of causality breaks state machines, confused debuggers, and ruins consistency guarantees. Therefore, we must abandon physical time as our source of truth.

1.2 Partial Orders and Happens-Before

If we cannot use time, we must use causality. We define the relationship between events not by when they happened, but by what caused what. Leslie Lamport formalized this in 1978 with the “Happens-Before” relation, denoted as \(\rightarrow\). 2

The rules are simple but absolute: 1. Program Order: If event \(a\) and event \(b\) occur in the same process, and \(a\) comes before \(b\), then \(a \rightarrow b\). 2. Message Passing: If event \(a\) is the sending of a message and event \(b\) is the receipt of that message, then \(a \rightarrow b\). 3. Transitivity: If \(a \rightarrow b\) and \(b \rightarrow c\), then \(a \rightarrow c\).

If two events \(x\) and \(y\) exist such that neither \(x \rightarrow y\) nor \(y \rightarrow x\), they are concurrent (\(x \parallel y\)). This does not mean they happened at the exact same physical instant. It means they are causally independent; the outcome of one could not have influenced the other.

This structure creates a Partial Order. Unlike a Total Order (where every pair of elements is comparable), a Partial Order allows for islands of events that have no relative ordering. Understanding this geometric structure is crucial for debugging race conditions in parameter updates: if two gradients are computed concurrently, the order in which they are applied is arbitrary unless we impose a mechanism to order them.

1.3 Logical Clocks

To make this partial order computable, we introduce Logical Clocks. These are mechanisms to map events to integers (or vectors) such that the numerical order respects the causal order.

Lamport Clocks

A Lamport Clock is a simple counter maintained by each process. * State: A single integer \(C\), initialized to 0. * Local Rule: Before executing an event, increment \(C \leftarrow C + 1\). * Send Rule: When sending a message \(m\), attach \(C\). (\(m, C\)). * Receive Rule: On receiving \((m, C_{msg})\), update local clock: \(C \leftarrow \max(C, C_{msg}) + 1\).

This algorithm guarantees that if \(a \rightarrow b\), then \(C(a) < C(b)\). This is useful for consistent logging and tie-breaking. However, it has a fatal flaw: the converse is not true. If \(C(a) < C(b)\), we cannot conclude that \(a \rightarrow b\). They might be concurrent events where \(a\) just happened to have a smaller counter. 3

Vector Clocks

To capture the full geometry of causality, we need Vector Clocks. Instead of a single integer, every process maintains a vector \(V\) of length \(N\) (the number of processes). * State: \(V_i[j] = 0\) for all \(j\). Process \(i\) tracks its knowledge of every other process’s progress. * Local Rule: Process \(i\) increments its own component: \(V_i[i] \leftarrow V_i[i] + 1\). * Send Rule: Send the full vector \(V_i\) with the message. * Receive Rule: On receiving \(V_{msg}\), update: \(V_i[j] \leftarrow \max(V_i[j], V_{msg}[j])\) for all \(j\). Then increment \(V_i[i]\).

Vector clocks provide the Strong Consistency Condition: \(V(a) < V(b) \iff a \rightarrow b\). If vectors are incomparable (e.g., \([2, 0]\) vs \([0, 2]\)), the events are concurrent. 4

We shall go in more detail into this now;

In many ways, Distributed systems are the physics of computer science. Just as Newtonian mechanics breaks down near the speed of light, our intuitive understanding of “order” and “simultaneity” disintegrates when we span a network. If you are building a system like Prime or DiLoCo—where hundreds of GPUs must agree on the state of a neural network—you are not just writing Python scripts; you are legislating a universe. You define what “now” means. You define what “happened” means. And, as we will see, you must accept that some things are fundamentally unknowable.

To build the massive, elastic clusters required for modern AI, we must first unlearn the concept of Time.

1.1 The Hallucination of Physical Clocks

In a single-threaded program running on a single CPU core, time is absolute. If you execute instruction \(A\) followed by instruction \(B\), \(A\) happens before \(B\). The CPU enforces this. The operating system enforces this. The physics of the silicon enforces this. We call this Program Order. 5

When we move to a distributed system—say, a cluster of H100s training a transformer—we lose this guarantee. We have multiple distinct processes, each with its own quartz crystal oscillator vibrating at a specific frequency. These crystals are physical objects; they are subject to heat, manufacturing variances, and age. They drift. Two identical servers powered on at the same moment will define “one second” slightly differently. Over days, they drift apart by milliseconds or seconds. 6

This leads to a terrifying realization: we cannot rely on physical timestamps to order events across machines.

Consider a scenario in Prime: 1. Node A updates a gradient at physical time \(t=100\). 2. Node A sends a message to Node B. 3. Node B receives the message. 4. Node B updates its weights.

If Node B’s clock is running 500ms behind Node A, Node B might assign timestamp \(t=90\) to its update. If we blindly trust these timestamps, it looks like Node B updated its weights before Node A calculated the gradient. The effect precedes the cause. Causality is violated.

In distributed systems, we care about Causality, not physical time. We don’t care when something happened; we care what caused it to happen. If an event \(A\) caused event \(B\), then \(A\) must happen before \(B\) in our model, regardless of what the wall clocks say.

1.2 Partial Orders and the Happens-Before Relation

To formalize this, we turn to Leslie Lamport’s 1978 paper, “Time, Clocks, and the Ordering of Events in a Distributed System.” This is perhaps the most important paper in the field. Lamport realized that “happening before” is not about time; it’s about information flow.

We define a relation, denoted by the arrow \(\rightarrow\), called happens-before. It satisfies three axioms:

  1. Program Order: If event \(a\) and event \(b\) occur in the same process, and \(a\) comes before \(b\) in that process’s execution trace, then \(a \rightarrow b\).
  2. Message Passing: If event \(a\) is the sending of a message by one process and event \(b\) is the receipt of that same message by another process, then \(a \rightarrow b\). 7
  3. Transitivity: If \(a \rightarrow b\) and \(b \rightarrow c\), then \(a \rightarrow c\).

If \(a \rightarrow b\), we say \(a\) causally precedes \(b\). This seems trivial, but it has a profound implication. What if we have two events, \(x\) and \(y\), such that we cannot reach \(y\) from \(x\) via any chain of messages and local steps, nor can we reach \(x\) from \(y\)?

If neither \(a \rightarrow b\) nor \(b \rightarrow a\) holds, then \(a\) and \(b\) are concurrent. We write this as \(a \parallel b\).

Concurrency, in the distributed systems sense, does not mean “simultaneous” (occurring at the same physical instant). It means “causally independent.” It means that \(a\) could not have influenced \(b\), and \(b\) could not have influenced \(a\). They are oblivious to each other. In the relativistic universe of a datacenter, they exists in each other’s “Elsewhere”—outside their respective light cones. 8

Visualizing with Space-Time Diagrams

To build intuition, we draw space-time diagrams. Time flows downward. * Vertical lines represent processes (timelines). * Dots on the lines represent events. * Diagonal arrows between lines represent messages.

Process P1      Process P2      Process P3
    |               |               |
    a               |               |
    | \             |               |
    |  \ message    |               |
    |   \           |               |
    b    ----->     c               |
    |               |  \            |
    |               |   \           |
    |               d    ----->     e
    |               |               |

In this diagram: * \(a \rightarrow b\) (Program order on P1). * \(b \rightarrow c\) (Message from P1 to P2). * \(c \rightarrow d\) (Program order on P2). * \(d \rightarrow e\) (Message from P2 to P3).

By transitivity, \(a \rightarrow e\). Event \(a\) “happened before” event \(e\). Information from \(a\) could have flowed to \(e\).

Now consider an event \(z\) on Process P3 that happens physically before \(e\) arrives. Is \(a \rightarrow z\)? No. Is \(z \rightarrow a\)? No. They are concurrent (\(a \parallel z\)). P3 has no way of knowing \(a\) happened until the chain of messages arrives. Until that moment, the state of P1 is effectively in a quantum superposition from P3’s perspective.

1.3 Logical Clocks: Capturing Causality with Integers

We have a definition of order (\(\rightarrow\)), but we need a mechanism to track it. We can’t query the global “happens-before” graph in real-time. We need a way to stamp events locally such that the stamps respect the causal order.

1.3.1 Lamport Clocks

The simplest mechanism is the Lamport Clock. Every process \(P_i\) maintains a single integer counter, \(C_i\).

The Algorithm: 1. Initialize \(C_i = 0\). 2. Local Rule: Before executing any internal event, increment \(C_i \leftarrow C_i + 1\). 3. Send Rule: When sending a message \(m\), attach the current timestamp \(t = C_i\). Send \((m, t)\). 4. Receive Rule: On receiving \((m, t)\), update the local clock: \(C_i \leftarrow \max(C_i, t) + 1\).

Properties: The Lamport Clock guarantees the Clock Consistency Condition: \[ \text{if } a \rightarrow b, \text{ then } C(a) < C(b). \]

This is powerful. It allows us to totally order events in the system. If we break ties using Process IDs (e.g., if \(C(a) = C(b)\), use \(ID(a) < ID(b)\)), we get a total ordering that is consistent with causality. This is crucial for things like a distributed mutex or a consistent log. 9

The Limitation: Note the direction of the implication. \(a \rightarrow b \implies C(a) < C(b)\). The converse is not true. If we see \(C(a) < C(b)\), we cannot conclude that \(a \rightarrow b\). They might be concurrent events where \(a\) just happened to have a smaller counter.

In Prime/DiLoCo, this ambiguity matters. When we look at a checkpoint from a peer, and its “step count” is lower than ours, did it happen in our past (causality)? or is it a divergent branch of training that happens to have a lower index (concurrency)? A simple integer step counter is essentially a Lamport Clock. It tells us strictly about ancestry only if we assume a linear chain. In a branching/merging topology, it is ambiguous.

1.3.2 Vector Clocks

To solve the ambiguity of Lamport clocks, we need Vector Clocks. Instead of a single integer, each process maintains a vector of integers \(V\), where \(V[j]\) represents the process’s knowledge of the logical time at Process \(j\).

The Algorithm: 1. Initialize \(V_i = [0, 0, \dots, 0]\) for all processes \(i\). 2. Local Rule: On an event at process \(i\), increment its own component: \(V_i[i] \leftarrow V_i[i] + 1\). 3. Send Rule: Send the full vector \(V_i\) with the message. 4. Receive Rule: On receiving vector \(V_{msg}\), update local vector: \[ \forall k: V_i[k] \leftarrow \max(V_i[k], V_{msg}[k]) \] Then increment self: \(V_i[i] \leftarrow V_i[i] + 1\).

The Isomorphism: Vector clocks provide an exact characterization of causality: \[ V(a) < V(b) \iff a \rightarrow b \]

We define “less than” for vectors as: \(V_A < V_B\) if every component of \(V_A\) is less than or equal to \(V_B\), and at least one component is strictly smaller. If neither \(V_A < V_B\) nor \(V_B < V_A\), then \(a \parallel b\).

Why don’t we use them everywhere? The size. The vector grows linearly with \(N\), the number of processes. In a DiLoCo training run with 512 nodes, attaching a 512-integer vector (2 KB) to every small metadata packet is wasteful. Furthermore, managing dynamic membership (nodes joining and leaving) requires resizing the vectors, which is algorithmically complex. 10

1.4 The FLP Impossibility Result

We have established how to order events if they happen. But can we get a group of computers to agree on something?

In 1985, Fischer, Lynch, and Paterson published a result that effectively said: “No.”

The FLP Impossibility Result states: > In an asynchronous system (unbounded message delays, varying process speeds) with reliable message delivery, if even one process can crash, there is no deterministic algorithm that can solve the Consensus problem (guaranteeing both safety and termination).

This result is mind-bending. It doesn’t say consensus is “hard”; it says it is impossible.

The Intuition (The Adversary): Imagine a system trying to decide between value “0” and value “1”. The system starts in a “bivalent” state (it could eventually decide 0, or it could decide 1). To reach a decision, the system must move to a “univalent” state (where only one outcome is possible). FLP proves that an adversary, who controls the scheduling of message delivery, can always keep the system in a bivalent state. Just as a process is about to receive the message that would tip the vote to “1”, the adversary delays that message. The process might then suspect the sender is dead and change its vote to “0” to keep progress going. The adversary then delivers the delayed message, pushing the system back toward “1”.

The adversary can maintain this teeter-totter forever. 11

Escaping FLP in the Real World: If consensus is impossible, how do Prime, Kubernetes, or Google Spanner work? They “cheat” by weakening the assumptions of FLP.

  1. Partial Synchrony: We assume that eventually the network behaves well. We use timeouts. A timeout (e.g., in Raft) is a mechanism that sacrifices Liveness for Safety. If the network is truly partitioned, Raft stops processing writes (liveness fails). But it never corrupts data (safety holds). When the network heals (synchrony returns), it proceeds.
  2. Randomization: FLP applies to deterministic algorithms. If we introduce randomness (e.g., Raft’s randomized election timeouts), we can break the symmetry that the adversary uses to maintain the bivalent state. The probability of an infinite deadlock drops to zero exponentially fast.

1.5 Synthesis: The Mindset for Phase 2

We conclude Phase 1 with a mental shift. When we design the control plane for Prime/DiLoCo:

  1. We will not trust time.time(). We will use Epochs (terms) and Step Counts (logical clocks).
  2. We will understand that any “leader” we elect is only a leader assuming the network is stable.
  3. We will realize that partitions are inevitable. When a node goes silent, we will be forced to make a choice: do we wait forever (sacrificing availability for consistency) or do we proceed without them (sacrificing consistency for availability)?

In DiLoCo, the answer is nuanced: * Inner Loop: Purely local. Availability is 100%. * Outer Loop: Requires consistency (All-Reduce). If the network is bad, we stall. We chose Consistency over Availability for the weight update. * Checkpointing: Uses “last writer wins” or causal merging.

Armed with the theory of Order and the grim reality of FLP, we are ready to construct the solution to Consensus. In Phase 2, we will build Paxos and Raft from scratch, seeing them not as magic libraries, but as direct responses to the problems raised here.


Part II: The Impossibility of Agreement

Once we understand time, we encounter the next hurdle: Consensus. How do we get a distributed set of nodes to agree on a single value (e.g., “Commit transaction X” or “Node Y is the leader”)?

2.1 The Two Generals Problem

The fundamental constraint of distributed systems is that we communicate over an unreliable channel. The Two Generals Problem proves that strict consensus is impossible if the communication link can drop messages.

Imagine two generals, Alice and Bob, camped on opposite hills of a valley occupied by the enemy. They must attack simultaneously to win; if they attack alone, they die. Alice sends a messenger: “Attack at dawn?” The messenger might be captured. If Bob gets the message, he sends a confirmation: “Yes, dawn.” But Bob doesn’t know if Alice got his confirmation. If Alice doesn’t receive it, she won’t attack. So Alice needs to acknowledge the acknowledgement. This infinite regression means they can never reach 100% common knowledge. 12

2.2 The FLP Result

In 1985, Fischer, Lynch, and Paterson extended this to a more subtle nightmare. They proved that in an asynchronous system (where message delay is unbounded) with reliable channels, if even one process can crash, consensus is impossible.

This is the FLP Impossibility Result. It implies that no algorithm can guarantee both Safety (agreement) and Liveness (termination) in a purely asynchronous model.

The intuition is based on “bivalence.” At any point, the system is in a state where it could eventually decide 0 or 1. If a process is about to send a decisive message that would tip the system to “1”, an adversary can delay that message. The system remains “bivalent.” The adversary can play this game forever, keeping the system in a livelock. 13


Part III: The Parliament of Protocols

Since we cannot have perfect consensus, we build algorithms that guarantee Safety absolutely, but rely on timing for Liveness. These are the Consensus Protocols.

3.1 Paxos

Paxos is the grandfather of consensus. It operates using a specific invariant: Quorum Intersection. If you need a majority (\(N/2 + 1\)) to agree to something, any two majorities must share at least one member. This overlap acts as the system’s memory.

Paxos defines three roles: 1. Proposers: Issue proposals with a proposal number \(N\). 2. Acceptors: Vote on proposals. They store the highest proposal number they’ve seen. 3. Learners: Listen for the chosen value.

The protocol has two phases (like a handshake): * Phase 1 (Prepare): Proposer chooses \(N\), sends “Prepare(\(N\))” to a quorum. Acceptors respond only if \(N >\) any proposal they’ve ever promised to. They reply with their last accepted value. * Phase 2 (Accept): If the Proposer hears from a majority, it picks a value (preferring the one returned by Acceptors) and sends “Accept(\(N, Val\)).” If Acceptors accept this, the value is chosen.

The complexity of Paxos lies in handling the failure modes. What if the Proposer crashes after Phase 1? What if a higher \(N\) comes in during Phase 2? The logic ensures that once a value is chosen, no future proposal can choose a different value. 14

3.2 Raft

Raft was designed to be understandable. It solves the exact same problem as Multi-Paxos (replicating a log) but decomposes it into three sub-problems:

  1. Leader Election: Time is divided into Terms. Nodes are Followers, Candidates, or Leaders. If a Follower hears nothing (timeout), it increments the Term, votes for itself, and requests votes. Majority vote wins.
  2. Log Replication: The Leader accepts client commands, appends them to its log, and sends AppendEntries RPCs to followers.
  3. Safety:
    • Election Safety: At most one leader per term.
    • Log Matching: If two logs contain an entry with the same index and term, then the logs are identical in all preceding entries.

The Log Matching Property is the subtle magic of Raft. It creates a recursive proof of consistency. When a follower acknowledges an entry at index 100, the leader knows for a fact that the follower also has entries 1–99 matching the leader’s log. This allows for simple “catch-up” logic: if logs diverge, the leader simply finds the last common index and overwrites the follower’s divergent suffix. 15


Part IV: Consistency Models

When we build a datastore (or a distributed memory for training), we must define what “reading” means. If I write \(X=5\) on Node A, when can Node B read \(5\)?

4.1 Linearizability (Strong Consistency)

Linearizability provides the illusion of a single, global copy of data. * Definition: Every operation takes effect atomically at some point between its invocation and response. * Constraint: Real-time ordering is preserved. If Op A finishes before Op B starts, B must see A.

Achieving this requires synchronization. To serve a linearizable read, a node must consult the leader or a quorum to ensure it isn’t holding stale data (e.g., if it was partitioned away). This adds latency (RTT) to every read. 16

4.2 Sequential Consistency

Weaker than linearizability. It requires that operations appear to take effect in some total order that is consistent with the order seen on each individual process. * Difference: It respects program order but not real-time order. If I write \(X\) and tell you on the phone “check X,” you might not see it yet, but once you see it, you will never see an older value.

4.3 Causal Consistency

This is the sweet spot for many high-performance systems. It guarantees that operations that are causally related are seen in the same order by everyone. Concurrent operations can be seen in different orders. * Scenario: A user posts a photo (Event A). The user comments on the photo (Event B). \(A \rightarrow B\). Causal consistency ensures no one sees the comment before the photo. However, if two distinct users like the photo concurrently, different replicas might see the likes arrive in different orders.

4.4 Session Guarantees

For end-users, we often implement specific “Session Guarantees” rather than system-wide consistency: * Read Your Writes: If I write X, my subsequent reads see X (or newer). * Monotonic Reads: If I see version \(V\), I will never see a version older than \(V\) in the future.

In Prime, we utilize a mix. The parameter updates in the outer loop strive for sequential consistency (everyone applies updates in the same order), but the gradient accumulation in the inner loop is purely local, and the checkpoint recovery uses a causal check (step count) to prevent rolling back training progress. 17


Part V: Architecture of Prime (DiLoCo)

Now we map these theoretical foundations to the concrete design of Prime/DiLoCo.

5.1 The Two-Loop Strategy

DiLoCo (Distributed Low-Communication) decouples the training into two time scales to circumvent the bandwidth limits of the WAN.

  1. Inner Loop (Local): Nodes run standard SGD on local data for \(k\) steps (e.g., 500 steps). This generates a “pseudo-gradient”—the difference between the starting weights and the current weights. This happens entirely in standard data-parallel groups (using NCCL within a region).
  2. Outer Loop (Global): The pseudo-gradients are averaged across all regions. This is the Consensus step. We are agreeing on the new global state.

By reducing the frequency of global synchronization by a factor of \(k\), we reduce the bandwidth requirement by \(k\). This allows us to train across continents where latency is high (100ms+) and bandwidth is low/expensive. 18

5.2 Elasticity and Membership (The “Raft” of Prime)

A critical requirement for Prime is Elasticity. Nodes on spot instances effectively “crash” randomly. We need a membership service.

We implement a dynamic view system. * Store: A persistent KV store (like etcd or a TCPStore) holds the “Ground Truth.” * Heartbeats: Every node writes a timestamp to the store every \(T\) seconds. * Leader: One node is elected (via lowest rank or a lock in the store) to monitor heartbeats. * Re-Init: If the leader detects a timeout (FLP suspicion), it declares a new “View.” It posts a new rank_map to the store. * Barrier: All surviving nodes pause, read the new map, rebuild their NCCL/Gloo communicators, and resume.

This is a specialized implementation of Virtual Synchrony. The system moves through a sequence of views. Within a view, membership is static. 19

5.3 The Causal Checkpoint

When a new node joins (or a node recovers), it needs the current model state. It cannot just start from zero. * Mechanism: The joiner sends an RPC: GetCheckpoint(view_id). * Invariant: The joiner must not participate in the Outer Loop until it has a model state consistent with the current outer_step. * Implementation: We attach a monotonic counter (Logical Clock) to every checkpoint. The joiner downloads the checkpoint, verifies ckpt.step == cluster.step, and only then enters the compute mesh.

5.4 Bandwidth-Aware Topologies

The “All-Reduce” operation (summing gradients) usually happens on a Ring or a Tree. In a WAN, not all links are equal. A generic Ring might route traffic A -> B -> C. If the A->B link is slow, the whole cluster suffers (head-of-line blocking).

Prime constructs a weighted graph of the network (using active bandwidth measurements). It then solves a Traveling Salesman Problem (TSP) variant to find the Hamiltonian path that maximizes total throughput. This is a static optimization performed at every “View Change.”

Furthermore, we use Quantization as a form of “approximate communication.” We cast 32-bit floats to 8-bit integers for the network traversal. This loses precision. * Systems View: We are relaxing the “correctness” of the message content to gain “liveness” (throughput). * Math View: We use a “Error Feedback” mechanism (keeping the quantization error locally) to ensure that the error cancels out over time and doesn’t diverge.


Part VI: Implementing the Roadmap

To build Prime from scratch, you do not start with PyTorch. You start with the primitives discussed in Stages 1-4.

Phase 1: The Simulator

Write a Python script that simulates generic processes with message queues. * Implement Lamport Clocks. * Implement a randomized delay function (chaos). * Exercise: Create a trace where Process A thinks \(X\) happened before \(Y\), but Process B thinks \(Y\) happened before \(X\). Prove they are concurrent.

Phase 2: The Consensus Engine

Implement a minimal Raft in Python. * Don’t optimize. Use a while True loop and time.sleep. * Invariant Check: Write a separate thread that constantly reads the logs of all nodes and asserts LogMatching (if logs match at index \(i\), they match at \(i-1\)). * Chaos: Kill the leader. Watch the election. Kill the network. Watch the stall.

Phase 3: The Transactional Store

Build a Key-Value store on top of your Raft log. * Implement Put(k, v) and Get(k). * Implement Linearizable Reads: A read must go to the leader and confirm it is still the leader (heartbeat check) before returning.

Phase 4: The Tensor Layer

Now, replace “strings” in your KV store with “Tensors.” * Replace Put with AllReduce (conceptually, a write that combines values). * Implement the Inner/Outer Loop logic. * Add the Elasticity: When a node dies, trigger the “View Change” logic from Phase 2 to exclude it, re-mesh, and continue.

Phase 5: The Prime System

Finally, swap your Python queues for TCP sockets (Gloo/NCCL), your in-memory logs for disk checkpoints, and your simple tensors for LLM weights.

You have now built DiLoCo.


Conclusion

Distributed systems is a discipline of pessimism. We assume the network is hostile, clocks are liars, and nodes are ephemeral. By accepting these hard truths and building upon the rigorous mathematical foundations of partial orders, consensus, and consistency models, we can construct systems like Prime that appear coherent and reliable.

The code in this repository is the artifact of this thinking process. As you explore src/, look for the shadows of Lamport, the echoes of Paxos, and the safeguards against FLP. They are the invisible struts holding the system up. 20


  1. Strictly speaking, modern superscalar CPUs with out-of-order execution do not execute instructions sequentially. They speculate, reorder, and parallelize. However, the CPU’s commit logic maintains the illusion of sequential execution for the programmer. In distributed systems, this illusion is shattered; there is no global CPU to maintain the facade.↩︎

  2. The Network Time Protocol (NTP) attempts to synchronize these clocks by exchanging messages with reference servers. However, NTP itself is subject to network latency variance (jitter). In a busy datacenter where queue depths fluctuate, NTP can typically guarantee synchronization only within tens of milliseconds. For a database or a high-frequency trading bot, 10ms is an eternity.↩︎

  3. This axiom is the bridge that connects the disjoint timelines of separate processes. Without message passing, every process would exist in its own isolated universe, and no inter-process ordering could essentially be established.↩︎

  4. In Special Relativity, the “light cone” defines the region of spacetime that a particle can influence (future) or be influenced by (past). Events outside the light cone are causally disconnected. In distributed systems, the speed of light is replaced by the latency of the network.↩︎

  5. Strictly speaking, modern superscalar CPUs with out-of-order execution do not execute instructions sequentially. They speculate, reorder, and parallelize. However, the CPU’s commit logic maintains the illusion of sequential execution for the programmer. In distributed systems, this illusion is shattered; there is no global CPU to maintain the facade.↩︎

  6. The Network Time Protocol (NTP) attempts to synchronize these clocks by exchanging messages with reference servers. However, NTP itself is subject to network latency variance (jitter). In a busy datacenter where queue depths fluctuate, NTP can typically guarantee synchronization only within tens of milliseconds. For a database or a high-frequency trading bot, 10ms is an eternity.↩︎

  7. This axiom is the bridge that connects the disjoint timelines of separate processes. Without message passing, every process would exist in its own isolated universe, and no inter-process ordering could essentially be established.↩︎

  8. In Special Relativity, the “light cone” defines the region of spacetime that a particle can influence (future) or be influenced by (past). Events outside the light cone are causally disconnected. In distributed systems, the speed of light is replaced by the latency of the network.↩︎

  9. In engineering terms, this means we can never be certain that a peer has received our last message. We must design systems that rely on quorums and timeouts rather than certainty.↩︎

  10. While FLP is mathematically terrifying, it is practically manageable. Real networks are not purely asynchronous; they are “partially synchronous.” We use timeouts. If a leader doesn’t respond in 500ms, we assume it’s dead. We might be wrong (violating the async model), but this assumption allows us to progress. We trade theoretical impossibility for practical probability.↩︎

  11. Paxos is notoriously difficult to implement correctly. The paper “Paxos Made Simple” is deceptively titled. Most “Paxos” implementations in the wild (Chubby, Zookeeper) are actually variations like Multi-Paxos or Zab, which optimize for a steady stream of decisions (logs) rather than single values.↩︎

  12. In engineering terms, this means we can never be certain that a peer has received our last message. We must design systems that rely on quorums and timeouts rather than certainty.↩︎

  13. While FLP is mathematically terrifying, it is practically manageable. Real networks are not purely asynchronous; they are “partially synchronous.” We use timeouts. If a leader doesn’t respond in 500ms, we assume it’s dead. We might be wrong (violating the async model), but this assumption allows us to progress. We trade theoretical impossibility for practical probability.↩︎

  14. Paxos is notoriously difficult to implement correctly. The paper “Paxos Made Simple” is deceptively titled. Most “Paxos” implementations in the wild (Chubby, Zookeeper) are actually variations like Multi-Paxos or Zab, which optimize for a steady stream of decisions (logs) rather than single values.↩︎

  15. The beauty of Raft is its strong leadership. In Paxos, any node can propose. In Raft, data only flows from Leader to Follower. This simplifies the logic but puts a burden on the leader. In Prime/DiLoCo, we use a variation of this for the elastic mesh: a “coordinator” manages the group membership log.↩︎

  16. This is the CAP Theorem in action. If you want Linearizability (Consistency), you must sacrifice Availability during a Partition (CP system). If you want every read to succeed immediately (Availability), you might serve stale data (AP system).↩︎

  17. Using strict linearizability for parameter servers is overkill and performance suicide. We permit “staleness” (Bounded Staleness) where workers might compute gradients on weights that are a few milliseconds old. The noise inherent in SGD (Stochastic Gradient Descent) makes the algorithm tolerant to this lack of strict consistency.↩︎

  18. In distributed systems terms, the Inner Loop allows replicas to diverge (concurrent writes). The Outer Loop is a Merge operation that reconciles this divergence. The math of DiLoCo proves that averaging these divergent paths results in a valid training trajectory, effectively treating the local steps as a “batch.”↩︎

  19. Handling the transition between views is the hardest part. What happens to the gradients computed during the view change? In Prime, we discard the partial inner-loop progress of the failed view to ensure safety (Log Matching), or we checkpoint aggressively to save it.↩︎

  20. Recommended reading to continue this journey: “Designing Data-Intensive Applications” by Martin Kleppmann, and the original Raft paper “In Search of an Understandable Consensus Algorithm” by Ongaro and Ousterhout.↩︎