官术网_书友最值得收藏!

MLlib

MLlib is Apache Spark's machine learning library. It is scalable, and consists of many commonly-used machine learning algorithms. Built-in to MLlib are algorithms for:

  • Handling data types in forms of vectors and matrices
  • Computing basic statistics like summary statistics and correlations, as well as producing simple random and stratified samples, and conducting simple hypothesis testing
  • Performing classification and regression modeling
  • Collaborative filtering
  • Clustering
  • Performing dimensionality reduction
  • Conducting feature extraction and transformation
  • Frequent pattern mining
  • Developing optimization
  • Exporting PMML models

The Spark MLlib is still under active development, with new algorithms expected to be added for every new release.

In line with Apache Spark's computing philosophy, the MLlib is built for easy use and deployment, with high performance.

MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas. The packages netlib-java and jblas also depend on native Fortran routines. Users need to install the gfortran runtime library if it is not already present on their nodes. MLlib will throw a linking error if it cannot detect these libraries automatically.

Note

For MLlib use cases and further details on how to use MLlib, please visit:

http://spark.apache.org/docs/latest/mllib-guide.html.

Other ML libraries

As discussed in previous part, MLlib has made available many frequently used algorithms like regression and classification. But these basics are not enough for complicated machine learning.

If we wait for the Apache Spark team to add all the needed ML algorithms it may take a long time. For this, the good news is that many third parties have contributed ML libraries to Apache Spark.

IBM has contributed its machine learning library, SystemML, to Apache Spark.

Besides what MLlib provides, SystemML offers a lot more additional ML algorithms like the ones on missing data imputation, SVM, GLM, ARIMA, and non-linear optimizers, and some graphical modelling and matrix factonization algorithms.

As developed by the IBM Almaden Research group, IBM's SystemML is an engine for distributed machine learning and it can scale to arbitrary large data sizes. It provides the following benefits:

  • Unifies the fractured machine learning environments
  • Gives the core Spark ecosystem a complete set of DML
  • Allows a data scientist to focus on the algorithm, not the implementation
  • Improves time to value for data science teams
  • Establishes a de facto standard for reusable machine learning routines

SystemML is modeled after R syntax and semantics, and provides the ability to author new algorithms via its own language.

Through a good integration with R by SparkR, Apache Spark users also have the potential to utilize thousands of R packages for machine learning algorithms, when needed. As will be discussed in later sections of this chapter, the SparkR notebook will make this operation very easy.

Note

For more about IBM SystemML, please visit http://researcher.watson.ibm.com/researcher/files/us-ytian/systemML.pdf

主站蜘蛛池模板: 凤山市| 扶绥县| 宝坻区| 洛宁县| 大竹县| 富宁县| 通江县| 云浮市| 江阴市| 扬州市| 白银市| 博罗县| 沭阳县| 福海县| 禹州市| 清苑县| 临汾市| 安宁市| 四会市| 日喀则市| 满城县| 海安县| 大理市| 莎车县| 乌海市| 漠河县| 永新县| 海南省| 隆德县| 卢龙县| 阿鲁科尔沁旗| 遂川县| 永丰县| 全椒县| 钟祥市| 松江区| 嘉善县| 尚志市| 太原市| 拜城县| 云浮市|