官术网_书友最值得收藏!

Data locality

The key for good data processing performance is avoidance of network transfers. This was very true a couple of years ago but is less relevant for tasks with high demands on CPU and low I/O, but for low demand on CPU and high I/O demand data processing algorithms, this still holds.

We can conclude from this that HDFS is one of the best ways to achieve data locality as chunks of files are distributed on the cluster nodes, in most of the cases, using hard drives directly attached to the server systems. This means that those chunks can be processed in parallel using the CPUs on the machines where individual data chunks are located in order to avoid network transfer.

Another way to achieve data locality is using ApacheSparkSQL. Depending on the connector implementation, SparkSQL can make use of data processing capabilities of the source engine. So for example when using MongoDB in conjunction with SparkSQL parts of the SQL statement are preprocessed by MongoDB before data is sent upstream to Apache Spark.

主站蜘蛛池模板: 都匀市| 包头市| 泽普县| 榆社县| 金平| 乐陵市| 鄂州市| 云梦县| 阳原县| 巩留县| 东源县| 灌云县| 定州市| 新化县| 滦南县| 丹巴县| 玉山县| 巧家县| 武邑县| 岐山县| 嘉义县| 新兴县| 高雄县| 阿鲁科尔沁旗| 淮南市| 平谷区| 靖江市| 手机| 沛县| 洛宁县| 古蔺县| 贵州省| 弥勒县| 桦甸市| 阿城市| 温泉县| 大丰市| 静安区| 巴林左旗| 远安县| 高邮市|