Minuteman: Design Choices
Let's use this post to extend on how Minuteman works to solve the consistency and sharding problems.
Minuteman is an append only clustered WAL replication library that can be used to created a distributed AP or CP database. It has leader election, replica assignment and built-in check pointing so all you need to do is plug your own single node database and make a sharding choice; Minuteman will take care of the rest for you.
Push Vs. PullA push based replication design is better if there are N keys where N ranges in the millions whereas a pull based replication design is better if there are N keys where N ranges in the hundreds. Pull systems are simpler for consistency management.
Explanation:In a pull design the entity that is being pulled needs to be tracked, therefore if there are millions of them, there will be overhead of tracking these entities as well as keeping their pull cycles independent of each other so they don't impact each other. Additionally, regardless of whether or not there have been any updates, a pull system will issue a poll request checking to see if newer data is available. This is once again major overhead in case of millions of entities. This means you can't use a pull system in conjunction with granular sharding i.e. sharding on small key ranges. With this being said, it can be easier to correct inconsistencies by using a WAL and keeping track of fully replicated offsets Vs. un-replicated offsets and simply overwriting uncommitted offsets when there is disagreement in between replicas. This is how Kafka's replication subsystem operates. On the other hand, Push systems only make network operations when data changes, this makes them fairly efficient when it comes to replicating millions of disparate entities. While the overhead for tracking and empty calls is eliminated, replica divergence becomes an major issue. When a replica is temporarily unavailable, it's not accepting writes therefore, replicas will become inconsistent. Incase of failure, we may see data loss. One mechanism to fix consistency in this case is to either run a periodic check to compare copies of data and fix as needed or to perform consistency check on read, which is what Cassandra does (Merkel Tree/Node Repair Tool) Minuteman uses a Pull based system for WAL replication.
Single Write Location:It is problematic if 2 nodes (servers) can accept write for a single shard at the same time. To avoid the problem of distributed concurrency management, the simplest approach is to declare 1 node as master or leader and other node as a slave. The definition being:
- All writes can be performed on the master
- Reads can be performed on master (consistent) or slave (may be inconsistent)
This leader/master can either be elected or assigned. Usually it's easier to have the leader be assigned by yet another entity called a coordinator i.e. the Master of the metadata operations in a cluster and this coordinator is elected using a leader election algorithm built on a consistent system like Zookeeper / Atomix / Ratis etc. Minuteman uses the concepts of Leader, Replica and Coordinator.
Replica Assignment:A distributed coordination service (Zookeeper/Atomix/Ratis) provide some basic features building blocks, namely: - Leader Election - Node Discovery / Status - Configuration Management Leader Election: Decisions in the cluster need to be made by an authoritative machine that all nodes in the cluster listen to. This node is usually the leader or the master of the cluster. Node Discovery / Status: This has to do with when a new node has joined the cluster or an existing node has left the cluster and that this decision is recognized cluster wide. Configuration Management: Information about cluster metadata needs to be kept safe. This information is usually fairly small however it can't be left on a single machine (leader) since if the leader fails the cluster will likely be clueless about the metadata. Minuteman leverages the cluster connector (Atomix or Zookeeper) to perform the above operations Leader Election: Leader election in a Minuteman refers to two separate topics: Coordinator Election: Coordinator election is delegated to the coordination service, the current implementation for Minuteman is using Atomix which uses Raft consensus protocol and is also the underlying leader election system. Replica Leader Assignment: Once we have a designated coordinator in the cluster, we can use a round robin replica assignment algorithm that simply picks the next available node in the cluster to assign a leader replica to. For a brand new WAL, Leader replica is simply the first replica in the assignment list. For an existing WAL, the next leader will be the next available ISR in the list.
A Few Minuteman Terms:WAL: Comparable to a Kafka partition, a WAL or Write Ahead Log is an entity in Minuteman that needs to be replicated across one or more machines. Each WAL is uniquely identified by a key / id which is used to create, track and maintain the replicas for the WAL. Replica: A replica is similar to a Kafka Replica, it stores a copy of a particular WAL. When a new WAL is requested in a Minuteman cluster, the coordinator creates a Routetable entry for this WAL id and then assigns replicas to it based on a round robin assignment fashion. The coordinator also elects a leader for this WAL as well and tells all the replicas about this leader so that they can start pulling data from it. Leader: A leader Replica is the one that receives writes from the client and other replicas (followers) pull data from. This is the only Replica that receives data (from client) via a push mechanism whereas all other Replicas pull data the leader. When a WAL leader changes the coordinator notifies the replica to update their local route information and start pulling WAL from the new leader replica. Follower: Follower is something (usually a Replica) that is reading from the WAL of a replica. A follower could be another Replica if this is a WAL on the Leader or it could be a Local WAL Client that is performing operations on the database instance. Follower states are tracked by the WAL, this state tracking includes the offset it's currently reading, the WAL file it's currently reading and whether or not the WAL thinks this follower is an ISR. ISR: A follower that has it's offset withing "threshold" to the current write location of the WAL segment is considered an ISR or an In-Sync Replica. This threshold is configurable and needs to be tuned based on the consistency guarantees needed and the throughput needed. Coordinator: Coordinator is the current master of the Minuteman cluster and is responsible for all meta operations including replica leader assignment and node heartbeat maintenance. A Coordinator is a Minuteman cluster node that is elected to be a master using a leader election algorithm built on a consensus protocol Paxos (Zookeeper) or Raft (Atomix) Coordinator announces Replica changes to the nodes in the cluster e.g. when a node fails it will more the leader ISR from that node to another available ISR and let other Replicas know where to continue reading from.