官术网_书友最值得收藏!

High availability and fault tolerance

One of the major advantages of Hadoop is the high availability of a cluster. However, it also brings the additional burden of processing nodes based on requirements, thereby impacting sizing. The Data Replication Factor (DRF) of an HDFS node is directly proportional to the size of cluster; for example, if you have 200 GB of usable data, and you need a high replication of 5 (that means each data block will be replicated five times in the cluster), then you need to work out sizing for 200 GB x 5, which equals 1 TB. The default value of DRF in Hadoop is 3. A replication value of 3 works well because:

  • It offers ample avenues to recover from one of two copies, in the case of a corrupt third copy
  • Additionally, even if a second copy fails during the recovery period, you still have one copy of your data to recover

While determining the replication factor, you need to consider the following parameters:

  • The network reliability of your Hadoop cluster
  • The probability of failure of a node in a given network
  • The cost of increasing the replication factor by one
  • The number of nodes or VMs that will make up your cluster

If you are building a Hadoop cluster with three nodes, a replication factor of 4 does not make sense. Similarly, if a network is not reliable, the name node can access copy from a nearby available node. For systems with higher failure probabilities, the risk of losing data is higher, given that the probability of a second node increases.

主站蜘蛛池模板: 澜沧| 张北县| 辛集市| 枣阳市| 苗栗市| 兴国县| 民勤县| 兴城市| 长乐市| 宜昌市| 西丰县| 遂溪县| 博客| 漠河县| 封开县| 钟祥市| 宜丰县| 阳东县| 南丹县| 土默特右旗| 毕节市| 元氏县| 梅河口市| 铜梁县| 自治县| 澄城县| 兴宁市| 慈溪市| 新宾| 横山县| 璧山县| 金乡县| 巴里| 临清市| 巴青县| 汉源县| 绍兴县| 精河县| 洛阳市| 华坪县| 滦平县|