官术网_书友最值得收藏!

MLlib supported methods and developer APIs

MLlib provides fast and distributed implementations of learning algorithms, including various linear models, Naive Bayes, SVM, and Ensembles of Decision Trees (also known as Random Forests) for classification and regression problems, alternating.

Least Squares (explicit and implicit feedback) are used for collaborative filtering. It also supports k-means clustering and principal component analysis (PCA) for clustering and dimensionality reduction.

The library provides some low-level primitives and basic utilities for convex optimization (http://spark.apache.org/docs/latest/mllib-optimization.html), distributed linear algebra (with support for Vectors and Matrix), statistical analysis (using Breeze and also native functions), and feature extraction, and supports various I/O formats, including native support for LIBSVM format.

It also supports data integration via Spark SQL as well as PMML (https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) (Guazzelli et al., 2009). You can find more information about PMML support at this link: https://spark.apache.org/docs/1.6.0/mllib-pmml-model-export.html.

Algorithmic Optimizations involves MLlib that includes many optimizations to support efficient distributed learning and prediction.

The ALS algorithm for recommendation makes use of blocking to reduce JVM garbage collection overhead and to utilize higher-level linear algebra operations. Decision trees use ideas from the PLANET project (reference: http://dl.acm.org/citation.cfm?id=1687569), such as data-dependent feature discretization to reduce communication costs, and tree ensembles parallelize learning both within trees and across trees.

Generalized linear models are learned using optimization algorithms, which parallelize gradient computation, using fast C++-based linear algebra libraries for worker.

Computations. Algorithms benefit from efficient communication primitives. In particular, tree-structured aggregation prevents the driver from being a bottleneck.

Model updates are combined partially on a small set of executors. These are then sent to the driver. This implementation reduces the load the driver has to handle. Tests showed that these functions reduce the aggregation time by an order of magnitude, especially on datasets with a large number of partitions.

(Reference: https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html)

Pipeline API includes practical machine learning pipelines that often involve a sequence of data preprocessing, feature extraction, model fitting, and validation stages.

Most of the machine learning libraries do not provide native support for the diverse set of functionalities for pipeline construction. When handling large-scale datasets, the process of wiring together an end-to-end pipeline is both labor-intensive and expensive from the perspective of network overheads.

Leveraging Spark's ecosystem: MLlib includes a package aimed to address these concerns.

The spark.ml package eases the development and tuning of multistage learning pipelines by providing a uniform set of high-level APIs (http://arxiv.org/pdf/1505.06807.pdf). It includes APIs that enable users to swap out a standard learning approach in place of their specialized algorithms.

主站蜘蛛池模板: 达孜县| 聂拉木县| 灌阳县| 甘孜县| 紫阳县| 营山县| 新郑市| 长兴县| 阳曲县| 鲁甸县| 临武县| 南投市| 临高县| 外汇| 定南县| 奇台县| 双牌县| 阳春市| 长治县| 平和县| 石棉县| 石泉县| 句容市| 瑞丽市| 阜康市| 和政县| 贡觉县| 桂阳县| 泉州市| 来凤县| 湘阴县| 台山市| 山西省| 德令哈市| 伊宁市| 丽江市| 泰和县| 杨浦区| 泗洪县| 阿克苏市| 鄂州市|