官术网_书友最值得收藏!

Data locality

The key for good data processing performance is avoidance of network transfers. This was very true a couple of years ago but is less relevant for tasks with high demands on CPU and low I/O, but for low demand on CPU and high I/O demand data processing algorithms, this still holds.

We can conclude from this that HDFS is one of the best ways to achieve data locality as chunks of files are distributed on the cluster nodes, in most of the cases, using hard drives directly attached to the server systems. This means that those chunks can be processed in parallel using the CPUs on the machines where individual data chunks are located in order to avoid network transfer.

Another way to achieve data locality is using ApacheSparkSQL. Depending on the connector implementation, SparkSQL can make use of data processing capabilities of the source engine. So for example when using MongoDB in conjunction with SparkSQL parts of the SQL statement are preprocessed by MongoDB before data is sent upstream to Apache Spark.

主站蜘蛛池模板: 甘谷县| 辽阳县| 大姚县| 木兰县| 肇州县| 陕西省| 屏边| 贡觉县| 资阳市| 漳浦县| 陵川县| 芦山县| 鄂尔多斯市| 吴桥县| 都匀市| 塔河县| 包头市| 铜梁县| 平陆县| 广州市| 洪湖市| 波密县| 临汾市| 沂水县| 青河县| 龙江县| 内黄县| 柯坪县| 固阳县| 宁城县| 东港市| 霍山县| 林西县| 尖扎县| 图木舒克市| 清远市| 博罗县| 绥阳县| 施秉县| 湄潭县| 吉安市|