官术网_书友最值得收藏!

Spark MLlib and ML

MLlib is a library that provides user-friendly ML algorithms that are implemented using Scala. The same API is then exposed to provide support for other languages such as Java, Python, and R. Spark MLlib provides support for local vectors and matrix data types stored on a single machine, as well as distributed matrices backed by one or multiple resilient distributed datasets (RDDs).

RDD is the primary data abstraction of Apache Spark, often called Spark Core, that represents an immutable, partitioned collection of elements that can be operated on in parallel. The resiliency makes RDD fault-tolerant (based on RDD lineage graph). RDD can help in distributed computing even when data is stored on multiple nodes in a Spark cluster. Also, RDD can be converted into a dataset as a collection of partitioned data with primitive values such as tuples or other objects.

Spark ML is a new set of ML APIs that allows users to quickly assemble and configure practical machine learning pipelines on top of datasets, which makes it easier to combine multiple algorithms into a single pipeline. For example, an ML algorithm (called estimator) and a set of transformers (for example, a StringIndexer, a StandardScalar, and a VectorAssembler) can be chained together to perform the ML task as stages without needing to run them sequentially.

Interested readers can take a look at the Spark MLlib and ML guide at https://spark.apache.org/docs/latest/ml-guide.html.

At this point, I have to inform you of something very useful. Since we will be using Spark MLlib and ML APIs in upcoming chapters too. Therefore, it would be worth fixing some issues in advance. If you're a Windows user, then let me tell you about a very weird issue that you will experience while working with Spark. The thing is that Spark works on Windows, macOS, and Linux. While using Eclipse or IntelliJ IDEA to develop your Spark applications on Windows, you might face an I/O exception error and, consequently, your application might not compile successfully or may be interrupted.

Spark needs a runtime environment for Hadoop on Windows too. Unfortunately, the binary distribution of Spark (v2.4.0, for example) does not contain Windows-native components such as winutils.exe or hadoop.dll. However, these are required (not optional) to run Hadoop on Windows if you cannot ensure the runtime environment, an I/O exception saying the following will appear:

03/02/2019 11:11:10 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

There are two ways to tackle this issue on Windows and from IDEs such as Eclipse and IntelliJ IDEA:

  1. Download winutls.exe from https://github.com/steveloughran/ winutils/tree/ master/hadoop-2. 7. 1/bin/.
  2. Download and copy it inside the bin folder in the Spark distribution—for example, spark-2.2.0-bin-hadoop2.7/bin/.
  3. Select Project | Run Configurations... | Environment | New | and create a variable named HADOOP_HOME, then put the path in the Value field. Here is an example: c:/spark-2.2.0-bin-hadoop2.7/bin/ | OK | Apply | Run.
主站蜘蛛池模板: 长治市| 岫岩| 临夏县| 甘孜| 饶河县| 南通市| 抚松县| 宝山区| 武强县| 宜君县| 始兴县| 平和县| 开化县| 南京市| 吉木萨尔县| 阳高县| 江永县| 绩溪县| 明星| 册亨县| 瑞安市| 同仁县| 吴川市| 大宁县| 简阳市| 塘沽区| 九龙城区| 汉中市| 天门市| 苏尼特左旗| 安泽县| 当阳市| 沧源| 东源县| 和龙市| 田林县| 昂仁县| 清原| 伊川县| 忻城县| 金川县|