- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 175字
- 2021-07-02 18:55:26
Data locality
The key for good data processing performance is avoidance of network transfers. This was very true a couple of years ago but is less relevant for tasks with high demands on CPU and low I/O, but for low demand on CPU and high I/O demand data processing algorithms, this still holds.
We can conclude from this that HDFS is one of the best ways to achieve data locality as chunks of files are distributed on the cluster nodes, in most of the cases, using hard drives directly attached to the server systems. This means that those chunks can be processed in parallel using the CPUs on the machines where individual data chunks are located in order to avoid network transfer.
Another way to achieve data locality is using ApacheSparkSQL. Depending on the connector implementation, SparkSQL can make use of data processing capabilities of the source engine. So for example when using MongoDB in conjunction with SparkSQL parts of the SQL statement are preprocessed by MongoDB before data is sent upstream to Apache Spark.
推薦閱讀
- C# 7 and .NET Core Cookbook
- 深度實踐OpenStack:基于Python的OpenStack組件開發
- Java EE 6 企業級應用開發教程
- Android Studio Essentials
- Software Testing using Visual Studio 2012
- Mastering Scientific Computing with R
- C語言程序設計實踐教程
- Learning SciPy for Numerical and Scientific Computing(Second Edition)
- NetBeans IDE 8 Cookbook
- Vue.js 2 Web Development Projects
- Node.js開發指南
- C++語言程序設計
- Mastering Elasticsearch(Second Edition)
- 從Excel到Python數據分析:Pandas、xlwings、openpyxl、Matplotlib的交互與應用
- The Statistics and Calculus with Python Workshop