官术网_书友最值得收藏!

Hadoop Distributed File System

You might consider using an alternative to HDFS, depending upon your cluster requirements. For instance, IBM has the GPFS (General Purpose File System) for improved performance.

The reason why GPFS might be a better choice is that, coming from the high performance computing background, this filesystem has a full read write capability, whereas HDFS is designed as a write once, read many filesystem. It offers an improvement in performance over HDFS because it runs at the kernel level as opposed to HDFS, which runs in a Java Virtual Machine (JVM) that in turn runs as an operating system process. It also integrates with Hadoop and the Spark cluster tools. IBM runs setups with several hundred petabytes using GPFS.

Another commercial alternative is the MapR file system that, besides performance improvements, supports mirroring, snapshots, and high availability.

Ceph is an open source alternative to a distributed, fault-tolerant, and self-healing filesystem for commodity hard drives like HDFS. It runs in the Linux kernel as well and addresses many of the performance issues that HDFS has. Other promising candidates in this space are Alluxio (formerly Tachyon), Quantcast, GlusterFS, and Lustre.

Finally, Cassandra is not a filesystem but a NoSQL key value store and is tightly integrated with Apache Spark and is therefore traded as a valid and powerful alternative to HDFS--or even to any other distributed filesystem--especially as it supports predicate push-down using ApacheSparkSQL and the Catalyst optimizer, which we will cover in the following chapters.

主站蜘蛛池模板: 湾仔区| 玉溪市| 杨浦区| 洛阳市| 新巴尔虎左旗| 阳东县| 闸北区| 清远市| 合水县| 广水市| 三原县| 和龙市| 监利县| 天祝| 苍南县| 平罗县| 常熟市| 郁南县| 合阳县| 隆林| 东宁县| 双城市| 馆陶县| 阿图什市| 巴彦县| 临江市| 南皮县| 镇康县| 大厂| 海原县| 望奎县| 锦州市| 福建省| 油尖旺区| 乌兰察布市| 宕昌县| 宜君县| 隆化县| 甘南县| 巴塘县| 成武县|