Building distributed systems: The basics

Distributed systems are difficult to build, due to challenges around consistency and sharding. This post discusses a few more approaches on how to solve the challenges around consistency.


Creating a distributed database is always challenging but this challenge can be broken down into two parts:
  1. Sharding
  2. Fault Tolerance
Note: The order of these two parts can always be changed depending on if scaling or fault tolerance is the primary need for distributing a database.


Sharding is segmenting a single dataset into two or more independent units each of which can be operated on in isolation without impacting the other. Sharding strategies can be as simple as modhash or key range based distribution or as complex as advanced implementations of consistent hashing. Let's say you have a MySQL database table storing users, now due to the number of users and transactional load on the table you have concluded that horizontal scaling is a viable option for you. To shard this database table, you decide to store A-M on server 1 and M-Z on server two. Now you have two database instances instead of 1 providing you the ability (in theory) to handle 2x load. Due to this design change, any applications read or writing to the original database now need to be aware of this sharding scheme. Also, you will likely need to make an outage window to perform this split and migrate maneuver. Depending on the growth of your application, you might need to split a few more times, essentially following a geometric progression on the number of instances. Such sharding mechanism is usually referred to as static sharding. Databases like HBase and Cassandra perform this dynamically, performing splits when a load increase is detected.


Replication is the fault tolerance topic. This tends to be a more important for business critical systems since zero downtime is expected and needed to keep businesses running 24x7. Replication essentially implies cloning data from Server 'A' to Server 'B', so if 'A' went down, 'B' would take over, keeping the lights on. Things get challenging now, because quantum entanglement is not how most computer systems run; it's not possible for data to be completely in sync between two separate servers at any given point in time. E.g. if you created an entry for a user X on Server 'A' at time t1, that entry won't appear on Server 'B' right at time t1. This means that at time t1, Server 'B' is in an inconsistent state, so if there was a power failure on Server 'A' 1 nanosecond after t1, Server 'B' still doesn't know about user X. This is the classic consistency problem, more complex versions of this problem can be studied under the Byzantine General's problem (


Since we have reviewed how replication and sharding can be challenging and have understood at a conceptual level how this would work, let's see how we can address some of these problems and make it easier for engineers to implement the above and make their system distributed.

The WAL:

Write Ahead Log, commonly known as a WAL is a powerful construct leveraged by most database systems and stores a transaction log before committing an operation to the final storage layer. WAL provides durability and atomicity for transactions. In addition to these two capabilities, WAL plays critical role in creating distributed systems by providing a mechanism to replicate the transaction log of an instance instead of replicating snapshots of the database state. This is essentially state machine replication using logs and can provide serializable consistency. WAL can also be extended to track what the read and write offsets are which gives us what data is and isn't committed to the end storage layer.


So if we can replicate the WAL between two database servers we can simplify the complexity of replicating data thereby achieving fault tolerance. We can build a simple subsystem to read WAL on Server 'A' and replay that to Server 'B'. Problem solved? Wait, we have to worry about a few other things. What about consistency? How do we make sure that all replicas have identical WALs? Let's borrow a few pages from Kafka and wait for my next post.


  1. This comment has been removed by a blog administrator.

  2. It is always challenging for me to create a distributed database but this useful link helped me very much in creating a distributed database.

  3. Very decent post. Nice explanation of distributed systems setup .

  4. It is very important to live the life without any site best essay writing service review for worries or tension. It is also very important for the health of the people. For this purpose, you should get up early in the morning and do exercise which helps you to refresh your mind. It is very necessary for you.

  5. The rate of the divorce is going to site for high and the reason is that the couple did not create the mutual understanding and did not try to make the bond so strong. Well, you give the information about the property problems about the divorce and it is good.

  6. In the game competition, you need the mutual understand between the other players because you have to play like a team. Lots of the players feel not good because they are new and of course, environment also new for them. This video will really helpful.

  7. It is all the information of the California state and in this article, discuss the PPO license defense lawyer. Here in this site, you mention about selections of the lawyers and the standard of the rules which is the base of the selection.

  8. is a popular platform for streaming live sports and events online. It offers a wide range of options for sports fans to watch their favorite games and events in real-time, without the need for a cable subscription or expensive pay-per-view fees.


Post a Comment

Popular posts from this blog

Minuteman: Design Choices

Consistency, the untold story