- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 175字
- 2021-07-02 18:55:26
Data locality
The key for good data processing performance is avoidance of network transfers. This was very true a couple of years ago but is less relevant for tasks with high demands on CPU and low I/O, but for low demand on CPU and high I/O demand data processing algorithms, this still holds.
We can conclude from this that HDFS is one of the best ways to achieve data locality as chunks of files are distributed on the cluster nodes, in most of the cases, using hard drives directly attached to the server systems. This means that those chunks can be processed in parallel using the CPUs on the machines where individual data chunks are located in order to avoid network transfer.
Another way to achieve data locality is using ApacheSparkSQL. Depending on the connector implementation, SparkSQL can make use of data processing capabilities of the source engine. So for example when using MongoDB in conjunction with SparkSQL parts of the SQL statement are preprocessed by MongoDB before data is sent upstream to Apache Spark.
推薦閱讀
- Bootstrap Site Blueprints Volume II
- Mastering Adobe Captivate 2017(Fourth Edition)
- 零起步玩轉掌控板與Mind+
- PyTorch自然語言處理入門與實戰
- JavaScript前端開發與實例教程(微課視頻版)
- 軟件架構:Python語言實現
- Java系統化項目開發教程
- Bootstrap 4 Cookbook
- SignalR:Real-time Application Development(Second Edition)
- 跟戴銘學iOS編程:理順核心知識點
- Getting Started with Electronic Projects
- Learning Concurrency in Python
- 快樂編程:青少年思維訓練
- 零基礎學Java第2版
- WCF全面解析