書名： Apache Spark Quick Start Guide
作者名： Shrey Mehrotra Akash Grade
本章字數： 138字
更新時間： 2021-07-02 13:39:55

Spark machine learning

It is difficult to run a machine-learning algorithm when your data is distributed across multiple machines. There might be a case when the calculation depends on another point that is stored or processed on a different executor. Data can be shuffling across executors or workers, but shuffle comes with a heavy cost. Spark provides a way to avoid shuffling data. Yes, it is caching. Spark's ability to keep a large amount of data in memory makes it easy to write machine-learning algorithms.

Spark MLlib and ML are the Spark’s packages to work with machine-learning algorithms. They provide the following:

Inbuilt machine-learning algorithms such as Classification, Regression, Clustering, and more
Features such as pipelining, vector creation, and more

The previous algorithms and features are optimized for data shuffle and to scale across the cluster.

官术网_书友最值得收藏!

Apache Spark Quick Start Guide

Spark machine learning