Part II of Series: Data-Intensive Application
To build modern Data-Intensive applications, it’s almost a mandatory requirement for these applications to be distributed. And in every distributed system, data replication and partitioning play vital roles in its design.
Replication means keeping a copy of a dataset across multiple machines, connected via a network usually.
There are many reasons why you’d want to do this:
- Keeping data geographically close to your users to minimise latency.
- Increase availability and fault-tolerance in the case of certain nodes failure.
- Increase throughput and performance by allowing multiple nodes to spread the load and deal with many more requests as a whole.
Partitioning on the other hand means splitting your data across different nodes. The main reason for partitioning is scalability. When we have big datasets that cannot physically be in the same machine, or it’s just inefficient to do so, we would usually partition it.
A simple use case for partitioning is splitting data geographically, where only users of a certain region need data for their operations, for example an order’s system where Asian orders are only needed in Asia versus European orders only needed in that other region.
Replication and partitioning are not mutually exclusive. In fact, they are usually combined to achieve better performance, where copies or smaller parts of a big dataset is stored and replicated across different nodes. This means that portions of the data are stored across multiple places, which in turn increases fault-tolerance.
Replication Strategies
If the data you are storing doesn’t ever change, you don’t really need replication strategies, you just store it in different places and forget about it. But as you can imagine, that’s rarely the case.
The main challenge for replication is dealing with keeping data up-to-date when changes do happen. For this, there are many popular strategies:
Single Leader
As you can imagine by the name, this strategy consists of having a single leader and multiple followers handling the replication. Each node will store a replica of the data, and one of the nodes is elected the leader. Application writes will go to this leader, and followers will receive replicated writes to stay up to date.
From here, there’s a world of possibilities. The most common scenario is that followers are read-only replicas, where clients can connect to in read-only mode.
Replication of data can happen synchronously or asynchronously, each of which has its own pros and cons.
Where replication is synchronous, the leader waits until the followers have confirmed they have the data before reporting success, whilst in asynchronous replication, the leader doesn’t wait for the followers’ confirmation.
The advantage of sync replication is consistency. You can guarantee that you have up-to-date data in all nodes. If the leader node fails, any follower can take its place. The disadvantage is that the process relies on all followers being in sync, which reduces performance. At the same time, if a follower fails, the leader needs to stop writting until the follower is back up and running.
On the other hand, the advantage of async replication is that leaders can continue processing writes even if followers are down. They can play catch-up when they’re back online. However, the disadvantage is of course that we can end-up with out-of-sync read replicas, and depending on the data you’re storing that could be problematic or not. Out of date bank transactions that require balance checking could be really problematic, but out of date Facebook statuses might not be as bad.
Multi-Leaders
To address some of the issues from single-leader setups, another strategy that came about was the multi-leaders, where more than one leader exists.
In this setup, the approach is similar to single-leader, but more than one node accepts writes, and each leader is a follower of other leaders at the same time.
As you can imagine, this is a more complex setup, and comes with its own set of advantages and problems.
As the main advantage, we can now continue writing data, even if one of the leaders fail, however the disadvantage is how hard it is to keep these leaders in sync, resolving conflicts when they arise. Think of collaborative editing applications such as Google Docs or Notion, where edits in each local copy can cause conflicts in certain areas that need to be resolved.
There are many strategies for handling conflict, which range from conflict avoidance to converging towards consisting states and will depend on the multi-leader topology. They are lengthy explained in the Martin Kleppmann book.
Leaderless
The leaderless approach abandons of course the leader node concept, and instead allows clients to write to every node. Databases such as DynamoDB and Cassandra take this approach.
This approach is increasingly more complex, and most of the time it relies on the majority of nodes achieving a state of consensus, where most nodes need to agree on what to write.
To resolve conflicts, there are also multiple strategies. The replication system should ensure that eventually all data is copied to every replica.
The read-repair strategy allows clients to detect state values when reading from multiple nodes and updating the outdated ones accordingly.
The anti-entropy strategy has a background process that constantly looks for outdated data in replicas and copies it across.
Partitioning Schemes
All replication strategies apply equally to replication of partitions. This means that partitioned data can be replicated in the same way and using the same strategies as mentioned before.
The main difference when data is being partitioned, is of course, how we decide to do the partitioning itself.
For this, there are multiple partitioning schemes to choose from.
Our main goal is to spread the data evenly across all nodes. If the partitioning is not even, we could end up with partitions lagging behind, or having to allocate uneven resources to certain nodes.
Partitioning by key range
One strategy is to assign a key to each item and store a range of such items continuously in each partition. Some datasets will have naturally occurring keys, for example user’s time of creation or uuids.
By knowing the boundaries of each partition, you can locate the node containing the data you’re looking for by simply accessing it with the given key. Within each partition you can apply sorting strategies as well that allow you to use efficient search algorithms.
The downside of this approach could be that, depending on the nature of the key, we may still end up with hot spots, where some partitions are working harder than others.
Partitioning by Hash of Key
To attempt to solve this problem, many distributed data stores use a hash function to determine the partition given a key.
A good hash function takes skewed data and makes it uniformly distributed. Once you have a good partitioning algorithm, you can assign each hash range to a partition, and go from there.
The downside of this approach is losing the ability to do efficient range queries within a partition
As you can see, the world of replication and partitioning is a complex one. Fortunately, most engineers don’t have to deal with the intrinsics of this topic. However it is very important that you understand it well, because based on these concepts you can design systems that leverage the correct properties of your setup and make them more efficient.
At the same time, understanding how to optimise large datasets for distributed systems is becoming more and more relevant in today’s world.
Next in the series we’ll see the difference between batch processing and stream processing.