官术网_书友最值得收藏!

MLlib supported methods and developer APIs

MLlib provides fast and distributed implementations of learning algorithms, including various linear models, Naive Bayes, SVM, and Ensembles of Decision Trees (also known as Random Forests) for classification and regression problems, alternating.

Least Squares (explicit and implicit feedback) are used for collaborative filtering. It also supports k-means clustering and principal component analysis (PCA) for clustering and dimensionality reduction.

The library provides some low-level primitives and basic utilities for convex optimization (http://spark.apache.org/docs/latest/mllib-optimization.html), distributed linear algebra (with support for Vectors and Matrix), statistical analysis (using Breeze and also native functions), and feature extraction, and supports various I/O formats, including native support for LIBSVM format.

It also supports data integration via Spark SQL as well as PMML (https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) (Guazzelli et al., 2009). You can find more information about PMML support at this link: https://spark.apache.org/docs/1.6.0/mllib-pmml-model-export.html.

Algorithmic Optimizations involves MLlib that includes many optimizations to support efficient distributed learning and prediction.

The ALS algorithm for recommendation makes use of blocking to reduce JVM garbage collection overhead and to utilize higher-level linear algebra operations. Decision trees use ideas from the PLANET project (reference: http://dl.acm.org/citation.cfm?id=1687569), such as data-dependent feature discretization to reduce communication costs, and tree ensembles parallelize learning both within trees and across trees.

Generalized linear models are learned using optimization algorithms, which parallelize gradient computation, using fast C++-based linear algebra libraries for worker.

Computations. Algorithms benefit from efficient communication primitives. In particular, tree-structured aggregation prevents the driver from being a bottleneck.

Model updates are combined partially on a small set of executors. These are then sent to the driver. This implementation reduces the load the driver has to handle. Tests showed that these functions reduce the aggregation time by an order of magnitude, especially on datasets with a large number of partitions.

(Reference: https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html)

Pipeline API includes practical machine learning pipelines that often involve a sequence of data preprocessing, feature extraction, model fitting, and validation stages.

Most of the machine learning libraries do not provide native support for the diverse set of functionalities for pipeline construction. When handling large-scale datasets, the process of wiring together an end-to-end pipeline is both labor-intensive and expensive from the perspective of network overheads.

Leveraging Spark's ecosystem: MLlib includes a package aimed to address these concerns.

The spark.ml package eases the development and tuning of multistage learning pipelines by providing a uniform set of high-level APIs (http://arxiv.org/pdf/1505.06807.pdf). It includes APIs that enable users to swap out a standard learning approach in place of their specialized algorithms.

主站蜘蛛池模板: 永宁县| 临安市| 浙江省| 竹北市| 大港区| 宣威市| 开封县| 方正县| 宁武县| 玛沁县| 哈尔滨市| 荣昌县| 林口县| 邓州市| 安泽县| 铁岭市| 太康县| 米泉市| 丹江口市| 永胜县| 阳山县| 临清市| 司法| 葫芦岛市| 大宁县| 八宿县| 临高县| 安宁市| 江津市| 新昌县| 娱乐| 视频| 东光县| 会东县| 化州市| 调兵山市| 浦东新区| 延川县| 隆安县| 清水河县| 金昌市|