書名： Machine Learning with Spark（Second Edition）
作者名： Rajdeep Dua Manpreet Singh Ghotra Nick Pentreath
本章字數： 412字
更新時間： 2021-07-09 21:07:57

MLlib supported methods and developer APIs

MLlib provides fast and distributed implementations of learning algorithms, including various linear models, Naive Bayes, SVM, and Ensembles of Decision Trees (also known as Random Forests) for classification and regression problems, alternating.

Least Squares (explicit and implicit feedback) are used for collaborative filtering. It also supports k-means clustering and principal component analysis (PCA) for clustering and dimensionality reduction.

The library provides some low-level primitives and basic utilities for convex optimization (http://spark.apache.org/docs/latest/mllib-optimization.html), distributed linear algebra (with support for Vectors and Matrix), statistical analysis (using Breeze and also native functions), and feature extraction, and supports various I/O formats, including native support for LIBSVM format.

It also supports data integration via Spark SQL as well as PMML (https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) (Guazzelli et al., 2009). You can find more information about PMML support at this link: https://spark.apache.org/docs/1.6.0/mllib-pmml-model-export.html.

Algorithmic Optimizations involves MLlib that includes many optimizations to support efficient distributed learning and prediction.

The ALS algorithm for recommendation makes use of blocking to reduce JVM garbage collection overhead and to utilize higher-level linear algebra operations. Decision trees use ideas from the PLANET project (reference: http://dl.acm.org/citation.cfm?id=1687569), such as data-dependent feature discretization to reduce communication costs, and tree ensembles parallelize learning both within trees and across trees.

Generalized linear models are learned using optimization algorithms, which parallelize gradient computation, using fast C++-based linear algebra libraries for worker.

Computations. Algorithms benefit from efficient communication primitives. In particular, tree-structured aggregation prevents the driver from being a bottleneck.

Model updates are combined partially on a small set of executors. These are then sent to the driver. This implementation reduces the load the driver has to handle. Tests showed that these functions reduce the aggregation time by an order of magnitude, especially on datasets with a large number of partitions.

(Reference: https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html)

Pipeline API includes practical machine learning pipelines that often involve a sequence of data preprocessing, feature extraction, model fitting, and validation stages.

Most of the machine learning libraries do not provide native support for the diverse set of functionalities for pipeline construction. When handling large-scale datasets, the process of wiring together an end-to-end pipeline is both labor-intensive and expensive from the perspective of network overheads.

Leveraging Spark's ecosystem: MLlib includes a package aimed to address these concerns.

The spark.ml package eases the development and tuning of multistage learning pipelines by providing a uniform set of high-level APIs (http://arxiv.org/pdf/1505.06807.pdf). It includes APIs that enable users to swap out a standard learning approach in place of their specialized algorithms.

官术网_书友最值得收藏!

Machine Learning with Spark（Second Edition）

MLlib supported methods and developer APIs