官术网_书友最值得收藏!

Data locality

The key for good data processing performance is avoidance of network transfers. This was very true a couple of years ago but is less relevant for tasks with high demands on CPU and low I/O, but for low demand on CPU and high I/O demand data processing algorithms, this still holds.

We can conclude from this that HDFS is one of the best ways to achieve data locality as chunks of files are distributed on the cluster nodes, in most of the cases, using hard drives directly attached to the server systems. This means that those chunks can be processed in parallel using the CPUs on the machines where individual data chunks are located in order to avoid network transfer.

Another way to achieve data locality is using ApacheSparkSQL. Depending on the connector implementation, SparkSQL can make use of data processing capabilities of the source engine. So for example when using MongoDB in conjunction with SparkSQL parts of the SQL statement are preprocessed by MongoDB before data is sent upstream to Apache Spark.

主站蜘蛛池模板: 会昌县| 恩施市| 土默特左旗| 原平市| 周至县| 洪雅县| 阿尔山市| 疏勒县| 巩义市| 南木林县| 顺昌县| 天峨县| 垣曲县| 奉化市| 北碚区| 西乌| 柳江县| 湖州市| 平远县| 丹东市| 阜南县| 灵武市| 高雄市| 平山县| 望都县| 徐水县| 久治县| 绵阳市| 邳州市| 大城县| 芮城县| 蕲春县| 柘荣县| 三门县| 百色市| 天镇县| 从江县| 龙里县| 珲春市| 鄂尔多斯市| 新晃|