官术网_书友最值得收藏!

Hadoop Distributed File System

You might consider using an alternative to HDFS, depending upon your cluster requirements. For instance, IBM has the GPFS (General Purpose File System) for improved performance.

The reason why GPFS might be a better choice is that, coming from the high performance computing background, this filesystem has a full read write capability, whereas HDFS is designed as a write once, read many filesystem. It offers an improvement in performance over HDFS because it runs at the kernel level as opposed to HDFS, which runs in a Java Virtual Machine (JVM) that in turn runs as an operating system process. It also integrates with Hadoop and the Spark cluster tools. IBM runs setups with several hundred petabytes using GPFS.

Another commercial alternative is the MapR file system that, besides performance improvements, supports mirroring, snapshots, and high availability.

Ceph is an open source alternative to a distributed, fault-tolerant, and self-healing filesystem for commodity hard drives like HDFS. It runs in the Linux kernel as well and addresses many of the performance issues that HDFS has. Other promising candidates in this space are Alluxio (formerly Tachyon), Quantcast, GlusterFS, and Lustre.

Finally, Cassandra is not a filesystem but a NoSQL key value store and is tightly integrated with Apache Spark and is therefore traded as a valid and powerful alternative to HDFS--or even to any other distributed filesystem--especially as it supports predicate push-down using ApacheSparkSQL and the Catalyst optimizer, which we will cover in the following chapters.

主站蜘蛛池模板: 石首市| 齐河县| 三穗县| 宁陕县| 宁明县| 邢台市| 梅河口市| 苏州市| 商丘市| 炉霍县| 张掖市| 和顺县| 安达市| 扎鲁特旗| 永安市| 通许县| 内乡县| 宁波市| 新巴尔虎右旗| 赤水市| 罗田县| 浑源县| 北川| 嘉黎县| 陆丰市| 益阳市| 舟山市| 庄浪县| 胶州市| 隆德县| 平乐县| 交城县| 龙里县| 金坛市| 宜良县| 长泰县| 金山区| 洛隆县| 怀来县| 潢川县| 谢通门县|