- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 175字
- 2021-07-02 18:55:26
Data locality
The key for good data processing performance is avoidance of network transfers. This was very true a couple of years ago but is less relevant for tasks with high demands on CPU and low I/O, but for low demand on CPU and high I/O demand data processing algorithms, this still holds.
We can conclude from this that HDFS is one of the best ways to achieve data locality as chunks of files are distributed on the cluster nodes, in most of the cases, using hard drives directly attached to the server systems. This means that those chunks can be processed in parallel using the CPUs on the machines where individual data chunks are located in order to avoid network transfer.
Another way to achieve data locality is using ApacheSparkSQL. Depending on the connector implementation, SparkSQL can make use of data processing capabilities of the source engine. So for example when using MongoDB in conjunction with SparkSQL parts of the SQL statement are preprocessed by MongoDB before data is sent upstream to Apache Spark.
推薦閱讀
- Spring Boot 2實戰之旅
- Unity 2020 Mobile Game Development
- Mastering Yii
- Building Serverless Applications with Python
- ExtJS高級程序設計
- 后臺開發:核心技術與應用實踐
- DB2SQL性能調優秘笈
- Selenium WebDriver Practical Guide
- Elasticsearch Blueprints
- Building Microservices with Go
- Java Script從入門到精通(第5版)
- Unity虛擬現實開發圣典
- PostGIS Cookbook
- 設計模式之禪
- Meteor Design Patterns