An introduction to Conflict-Free Replicated Data Types

Part 6: Time and Causality

Reading time: 14 minutes.
Series navigation 📜
  1. Preliminaries
  2. Algebras & contracts
  3. Lattices
  4. Combinators
  5. Tombstones
  6. Time
  7. Registers and Deletion
  8. Outlook

This is an interactive tutorial series about Conflict-Free Replicated Data Types, or CRDTs for short. Their purpose is to allow seamless replication of data on different nodes in a distributed system. Merging is by construction always possible, without any conflicts. This series assumes no knowledge about CRDTs, but be prepared to learn a thing or two about algebras. All code samples on this page are interactive and executed in your browser. Understanding the code is necessary for understanding the concepts, so you should be familiar with JavaScript. If you notice any bugs on this page, please let me know!


Let’s talk about distributed systems.

In the previous episodes, I’ve shown you “event logs” like these:

  1. Alice and Bob start with the set containing just 1.
  2. Internet connection fails.
  3. Alice adds a 2.
  4. Bob adds a 3.
  5. Internet connection is restored.

In a distributed system, Alice and Bob would use different nodes in a network. Even though the event log pretends to be linearly ordered, this is not how reality in such a network works. For example, since the connection failed in Step 2, neither Alice nor Bob know who of them added their respective element first. As far as they are concerned, they performed these steps independently of each other.

But why is that important? Well, sometimes we want to impose an order on events. But the theory of distributed systems teaches us that that is very hard.

In his seminal paper “Time, clocks, and the ordering of events in a distributed system”, Leslie Lamport outlines the fundamental notions of time and causality in networks. I will try to give a summary in this post that’s required to understand CRDTs, liberally using quotes from the paper.

In the introduction, Lamport writes:

The concept of time is fundamental to our way of thinking. It is derived from the more basic concept of the order in which events occur. […] The concept of the temporal ordering of events pervades our thinking about systems. […] However, we will see that this concept must be carefully reexamined when considering events in a distributed system.

The analogy Lamport uses are clocks. When ordering events, we humans look at the time that these events happened – using a wristwatch or some other clock – and then compare the time. For example, an event that happened on 1970-01-01 00:00:00 UTC is said to have happened before another event that happened on 2038-01-19 03:14:08 UTC, because 1970-01-01 00:00:00 UTC < 2038-01-19 03:14:08 UTC. Timezone shenanigans aside, this is very convenient for human life because we can always tell the ordering of two events.

For a distributed system, you are going to need a big clock (e.g. the Ghibli Clock in Tokyo)
For a distributed system, you are going to need a big clock (e.g. the Ghibli Clock in Tokyo)

In a distributed system, you don’t get that luxury. Lamport defines such a system as “a collection of distinct processes which are spatially separated” – e.g., running on different physical nodes – “which communicate with one another by exchanging messages”. We generally assume that exchanging messages takes time, for example by sending them over a network cable.

The key insight (if you take away only one thing, it should be this): In a distributed system, events occurring across processes can only be ordered through messages. What does that mean? If process A performs some action, and process B performs some action, they know nothing about their respective orders. Only if process A sends a message M to process B, then we know that sending the message must have happened before receiving that message.

When I first read about this, it seemed “too obvious”, but it is actually a profound insight. The consequence is that events in a distributed system are only partially ordered, and we call that ordering causality.

Lamport proceeds to define this relation which he calls → (read: “happened before”):

  1. If a and b are events in the same process, and a comes before b, then ab.
  2. If a is the sending of a message by one process and b is the receipt of the same message by another process, then ab.
  3. If ab and bc then ac.

Two distinct events a and b are said to be concurrent if neither ab nor ba.

This should sound oddly familiar to you, my dear reader. In fact, the third rule is transitivity, and the entire thing defines → to be a partial ordering. Note though that there is a small difference in the way Lamport defines this and the way I defined this in an earlier episode: → is not reflexive, so it would be more accurate to compare → to < on numbers (instead of ≤).

The second important consequence of this insight is that if two events are concurrent, then they cannot causally affect each other. If you reconsider Alice’s and Bob’s event log, that means that Steps 3 and 4 are independent by construction. It even turns out that according to this model, it doesn’t really matter whether the network connection gets lost or not; all that matters is whether or not they (successfully) exchange messages.

To make this concrete, consider the following example of a distributed system with three nodes:

see text below for a description

Let’s unpack this since there’s a lot going on there.

We have three nodes, called A, B, and C. Each event in this diagram has a number assigned to it in a little box, which we can ignore for now. More importantly, each node assigns a label to an event as it occurs within that node. For example, node C sends a message to node B. The sending event is assigned the label C₁, the receiving event the label B₁. We can say that C₁ has caused the following events:

The diagram proceeds with B sending a message to A. The sending is now assigned B₂ and the receipt A₁. Observe how the “local” label in A has a smaller number than in B, yet B₂ → A₁ holds. By transitivity, C₁ → A₁ also holds.

Independently of A, B sends another message to C, labelled with B₃ and C₂. We say that B₃ and A₁ are concurrent. Locally, we can always say that XiXj for any node X if i < j, but for different nodes, we need to look at their communication.

There’s more happening in the diagram, but by now, I think you get the idea.

Before we move on, let’s take a quick look at the numbers in the little boxes. You’ll notice that each node has its own number; they all start with zero; and they vaguely increase from left to right, giving an impression of global time. This is called a Lamport clock. It’s not quite global time, but it’s a nifty mechanism to synchronize clocks across different machines. The idea is as follows:

Try tracing this concept through the diagram above. For example: when C sends its first message, it includes the Lamport time 1 in the message. When B receives that message, local time is still 0. B notices that the received time is 1, so uses that, and adds 1, bringing local clock in B to 2.

Keep in mind that the Lamport time is not a global ordering of events! Despite the fact that B₃ and A₁ both happen at time 4, they’re concurrent. We do know though that whenever an event a happens before b (i.e. ab), then the Lamport clock at a must be smaller than at b.1 This is ensured by incrementing the clock when sending and receiving a message.

Harrison Ford saying 'What do you want' in Blade Runner 2049
Harrison Ford saying 'What do you want' in Blade Runner 2049

Are you still here? Good. Because now I can tell you what all of this has to do with CRDTs. So, let’s talk about replicants replicas. We’re going to take a closer look at the Shapiro paper that I referenced in Episode 4. Recall the definition of state-based CRDTs: they

  1. have a join-semilattice
  2. only support monotonic operations

So far, so good. We’ve covered that in previous episodes.

But what I actually haven’t told you is one of the central guarantees that this definition gives us. Shapiro et al. call it – in academic modesty – “Proposition 2.1” (it should’ve been a Theorem):

Any two object replicas of a CvRDT eventually converge, assuming the system transmits payload infinitely often between pairs of replicas over eventually-reliable point-to-point channels.

Let’s unpack this.

The term CvRDT stands for Convergent Replicated Data Type and is synonymous with state-based CRDT (the latter term now being preferred). The statement now claims something about convergence of object replicas that exchange messages. This deserves some further explanation.

Our basic assumption here is that we have a distributed network of nodes. Each node stores the same CRDT, but may be at a different state. At some point, Alice knows that the set contains the elements {1, 2} and Bob knows that the set contains {2, 3}. Whenever they feel like it (or when they have carrier pigeons available), they can send their state to each other.

For example, Alice tells Bob that her set has the elements {1, 2}, which will cause Bob to update his set to {1, 2, 3}. Bob may do the same in reverse. Payload refers to the internal state of the CRDT, here: the set containing elements. This is the data that is exchanged between nodes. For a 2P-Set, that would be two sets (or a map, according to the generic representation).

Convergence now means that if Alice and Bob keep sending each other updates, and these updates will be delivered at some point, they end up at the same state. It may take a while, but it only takes finitely many carrier pigeons to converge.

If you’re happy with what you’ve read so far and are not overly interested in abstract symbols, then you can call it a day; there’s not much more happening in this episode. However, if you’re like me, you can stay for a little longer and read about the formal definition of convergence. Note that the side note on abstract data types is necessary for understanding the following.

We first start with the causal history of a replica. Let x be any object and xi the various replicas of that object. Furthermore, assume that f is any monotonic state update function, and join is the usual join operation on the lattice for x.

Now we can define the causal history CH as a set of operations (quoted from the paper):

  1. initially, CH(xi) = {}
  2. after executing the f function, CH(f(xi)) = CH(xi) ∪ {f}
  3. after joining, CH(join(xi, xj)) = CH(xi) ∪ CH(xj)

This history essentially provides a trace of operations that have occurred on a replica. Note that the history is unordered, and operations are unique (in other words, if two replicas apply the “same” state update, that results in two distinct operations). Some authors prefer an ordered list, but for defining convergence, this one seems more convenient to me.

An immediate consequence of this definition is that causal ordering relation coincides with the subset relation of this history. Let’s say we want to compare the events a and b. If the causal history at point a is a subset of the causal history at point b, then ab. In this scenario, an event can either be:

Let’s look at another diagram.

see text below for a description

This is the same scenario as above, where Alice adds 2 to her set and Bob adds 3 to his. Before they exchange messages, the causal history of Alice is {add(2)}. For Bob, it is {add(3)}. This means that both events are concurrent since neither history is a subset of the other. When Alice sends her set to Bob, his causal history is now {add(2), add(3)}. We also know that both add(3) and add(2) have logically happened before the join, because both are subsets of Bob’s causal history at this point.

Finally, let’s look at the formal definition of convergence. Then we can move on to investigate why Proposition 2.1 holds.

Quoted from the paper: Two replicas xi and xj of an object x converge eventually if:

  1. CH(xi) = CH(xj) implies that the states of xi and xj are equivalent (safety)
  2. for each f in CH(xi), f will be in CH(xj) eventually (liveness)

After all you’ve seen so far in this series, this should be common sense. A join operation shouldn’t invent or destroy any values from other nodes, and it should include the correct values at some point.

Let’s conclude by revisiting Proposition 2.1:

Any two object replicas of a CvRDT eventually converge, assuming the system transmits payload infinitely often between pairs of replicas over eventually-reliable point-to-point channels.

The reason why this holds is that we’ve set up the join operation and the monotonic updates accordingly. The join of two objects is always defined; this guarantees liveness. Safety is ensured because join is also commutative.

The Doctor explaining that time is more like a big ball of wibbly-wobbly, timey-wimey... stuff
The Doctor explaining that time is more like a big ball of wibbly-wobbly, timey-wimey... stuff

To conclude: even though uniform, global time is hard to define in a distributed system, we still have a formal notion of causality. Equipped with this, we can also define the unique properties of CRDTs, namely that they eventually arrive at the same state on all nodes, assuming that the replicas can deliver updates to each other. In the next episode, we will use some of that knowledge to look at more sophisticated notions of deletion.

References

  1. Let LC(x) be the Lamport time when event x occured on a particular node. Formally speaking, if ab then LC(a) < LC(b). This is called clock consistency. The reverse (called strong clock consistency) is not true. But by contrapositive we can obtain that if LC(a) ≥ LC(b) then a couldn’t have affected b


Thanks to the people who've read drafts of this series and provided valuable feedback: Andrea, Clement Delafargue, Heiko Seeberger, Hillel Wayne, Johannes Link, Matthew Weidner, Princess.