- Apache Hadoop 3 Quick Start Guide
- Hrishikesh Vijay Karambelkar
- 268字
- 2021-06-10 19:18:45
High availability and fault tolerance
One of the major advantages of Hadoop is the high availability of a cluster. However, it also brings the additional burden of processing nodes based on requirements, thereby impacting sizing. The Data Replication Factor (DRF) of an HDFS node is directly proportional to the size of cluster; for example, if you have 200 GB of usable data, and you need a high replication of 5 (that means each data block will be replicated five times in the cluster), then you need to work out sizing for 200 GB x 5, which equals 1 TB. The default value of DRF in Hadoop is 3. A replication value of 3 works well because:
- It offers ample avenues to recover from one of two copies, in the case of a corrupt third copy
- Additionally, even if a second copy fails during the recovery period, you still have one copy of your data to recover
While determining the replication factor, you need to consider the following parameters:
- The network reliability of your Hadoop cluster
- The probability of failure of a node in a given network
- The cost of increasing the replication factor by one
- The number of nodes or VMs that will make up your cluster
If you are building a Hadoop cluster with three nodes, a replication factor of 4 does not make sense. Similarly, if a network is not reliable, the name node can access copy from a nearby available node. For systems with higher failure probabilities, the risk of losing data is higher, given that the probability of a second node increases.