官术网_书友最值得收藏!

Hadoop Distributed File System

You might consider using an alternative to HDFS, depending upon your cluster requirements. For instance, IBM has the GPFS (General Purpose File System) for improved performance.

The reason why GPFS might be a better choice is that, coming from the high performance computing background, this filesystem has a full read write capability, whereas HDFS is designed as a write once, read many filesystem. It offers an improvement in performance over HDFS because it runs at the kernel level as opposed to HDFS, which runs in a Java Virtual Machine (JVM) that in turn runs as an operating system process. It also integrates with Hadoop and the Spark cluster tools. IBM runs setups with several hundred petabytes using GPFS.

Another commercial alternative is the MapR file system that, besides performance improvements, supports mirroring, snapshots, and high availability.

Ceph is an open source alternative to a distributed, fault-tolerant, and self-healing filesystem for commodity hard drives like HDFS. It runs in the Linux kernel as well and addresses many of the performance issues that HDFS has. Other promising candidates in this space are Alluxio (formerly Tachyon), Quantcast, GlusterFS, and Lustre.

Finally, Cassandra is not a filesystem but a NoSQL key value store and is tightly integrated with Apache Spark and is therefore traded as a valid and powerful alternative to HDFS--or even to any other distributed filesystem--especially as it supports predicate push-down using ApacheSparkSQL and the Catalyst optimizer, which we will cover in the following chapters.

主站蜘蛛池模板: 绥宁县| 轮台县| 武宁县| 新丰县| 平潭县| 宁远县| 陵水| 南宁市| 岳西县| 革吉县| 昌邑市| 杭锦后旗| 波密县| 莎车县| 太白县| 团风县| 靖安县| 股票| 仁化县| 大英县| 平凉市| 内黄县| 澄城县| 固阳县| 舒城县| 镇平县| 寿宁县| 剑河县| 秦安县| 睢宁县| 大余县| 兖州市| 红安县| 玉山县| 孙吴县| 石林| 叙永县| 太康县| 顺昌县| 讷河市| 泊头市|