- Apache Spark Quick Start Guide
- Shrey Mehrotra Akash Grade
- 138字
- 2021-07-02 13:39:55
Spark machine learning
It is difficult to run a machine-learning algorithm when your data is distributed across multiple machines. There might be a case when the calculation depends on another point that is stored or processed on a different executor. Data can be shuffling across executors or workers, but shuffle comes with a heavy cost. Spark provides a way to avoid shuffling data. Yes, it is caching. Spark's ability to keep a large amount of data in memory makes it easy to write machine-learning algorithms.
Spark MLlib and ML are the Spark’s packages to work with machine-learning algorithms. They provide the following:
- Inbuilt machine-learning algorithms such as Classification, Regression, Clustering, and more
- Features such as pipelining, vector creation, and more
The previous algorithms and features are optimized for data shuffle and to scale across the cluster.