官术网_书友最值得收藏!

MLlib supported methods and developer APIs

MLlib provides fast and distributed implementations of learning algorithms, including various linear models, Naive Bayes, SVM, and Ensembles of Decision Trees (also known as Random Forests) for classification and regression problems, alternating.

Least Squares (explicit and implicit feedback) are used for collaborative filtering. It also supports k-means clustering and principal component analysis (PCA) for clustering and dimensionality reduction.

The library provides some low-level primitives and basic utilities for convex optimization (http://spark.apache.org/docs/latest/mllib-optimization.html), distributed linear algebra (with support for Vectors and Matrix), statistical analysis (using Breeze and also native functions), and feature extraction, and supports various I/O formats, including native support for LIBSVM format.

It also supports data integration via Spark SQL as well as PMML (https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) (Guazzelli et al., 2009). You can find more information about PMML support at this link: https://spark.apache.org/docs/1.6.0/mllib-pmml-model-export.html.

Algorithmic Optimizations involves MLlib that includes many optimizations to support efficient distributed learning and prediction.

The ALS algorithm for recommendation makes use of blocking to reduce JVM garbage collection overhead and to utilize higher-level linear algebra operations. Decision trees use ideas from the PLANET project (reference: http://dl.acm.org/citation.cfm?id=1687569), such as data-dependent feature discretization to reduce communication costs, and tree ensembles parallelize learning both within trees and across trees.

Generalized linear models are learned using optimization algorithms, which parallelize gradient computation, using fast C++-based linear algebra libraries for worker.

Computations. Algorithms benefit from efficient communication primitives. In particular, tree-structured aggregation prevents the driver from being a bottleneck.

Model updates are combined partially on a small set of executors. These are then sent to the driver. This implementation reduces the load the driver has to handle. Tests showed that these functions reduce the aggregation time by an order of magnitude, especially on datasets with a large number of partitions.

(Reference: https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html)

Pipeline API includes practical machine learning pipelines that often involve a sequence of data preprocessing, feature extraction, model fitting, and validation stages.

Most of the machine learning libraries do not provide native support for the diverse set of functionalities for pipeline construction. When handling large-scale datasets, the process of wiring together an end-to-end pipeline is both labor-intensive and expensive from the perspective of network overheads.

Leveraging Spark's ecosystem: MLlib includes a package aimed to address these concerns.

The spark.ml package eases the development and tuning of multistage learning pipelines by providing a uniform set of high-level APIs (http://arxiv.org/pdf/1505.06807.pdf). It includes APIs that enable users to swap out a standard learning approach in place of their specialized algorithms.

主站蜘蛛池模板: 涟水县| 利川市| 溆浦县| 印江| 沙湾县| 通榆县| 四子王旗| 阿巴嘎旗| 勃利县| 阿巴嘎旗| 玛纳斯县| 屯昌县| 富宁县| 门源| 逊克县| 新乡市| 库车县| 阿拉善左旗| 刚察县| 哈巴河县| 山丹县| 奈曼旗| 扬州市| 鄂托克前旗| 靖州| 合阳县| 丹棱县| 中牟县| 马鞍山市| 环江| 甘谷县| 泗阳县| 健康| 六安市| 南召县| 特克斯县| 江西省| 招远市| 河西区| 分宜县| 息烽县|