官术网_书友最值得收藏!

High availability and fault tolerance

One of the major advantages of Hadoop is the high availability of a cluster. However, it also brings the additional burden of processing nodes based on requirements, thereby impacting sizing. The Data Replication Factor (DRF) of an HDFS node is directly proportional to the size of cluster; for example, if you have 200 GB of usable data, and you need a high replication of 5 (that means each data block will be replicated five times in the cluster), then you need to work out sizing for 200 GB x 5, which equals 1 TB. The default value of DRF in Hadoop is 3. A replication value of 3 works well because:

  • It offers ample avenues to recover from one of two copies, in the case of a corrupt third copy
  • Additionally, even if a second copy fails during the recovery period, you still have one copy of your data to recover

While determining the replication factor, you need to consider the following parameters:

  • The network reliability of your Hadoop cluster
  • The probability of failure of a node in a given network
  • The cost of increasing the replication factor by one
  • The number of nodes or VMs that will make up your cluster

If you are building a Hadoop cluster with three nodes, a replication factor of 4 does not make sense. Similarly, if a network is not reliable, the name node can access copy from a nearby available node. For systems with higher failure probabilities, the risk of losing data is higher, given that the probability of a second node increases.

主站蜘蛛池模板: 成安县| 尼玛县| 南丹县| 伊金霍洛旗| 大冶市| 合肥市| 汪清县| 尖扎县| 清涧县| 遵义市| 边坝县| 阿拉善左旗| 长子县| 银川市| 祁东县| 长葛市| 辽源市| 楚雄市| 大安市| 奉节县| 基隆市| 余干县| 茌平县| 隆昌县| 长海县| 钦州市| 类乌齐县| 田阳县| 独山县| 瓮安县| 贞丰县| 奈曼旗| 临猗县| 拜泉县| 武义县| 垫江县| 中宁县| 江陵县| 阆中市| 平泉县| 乐都县|