- Machine Learning with Scala Quick Start Guide
- Md. Rezaul Karim
- 503字
- 2021-06-24 14:32:02
Spark MLlib and ML
MLlib is a library that provides user-friendly ML algorithms that are implemented using Scala. The same API is then exposed to provide support for other languages such as Java, Python, and R. Spark MLlib provides support for local vectors and matrix data types stored on a single machine, as well as distributed matrices backed by one or multiple resilient distributed datasets (RDDs).
RDD is the primary data abstraction of Apache Spark, often called Spark Core, that represents an immutable, partitioned collection of elements that can be operated on in parallel. The resiliency makes RDD fault-tolerant (based on RDD lineage graph). RDD can help in distributed computing even when data is stored on multiple nodes in a Spark cluster. Also, RDD can be converted into a dataset as a collection of partitioned data with primitive values such as tuples or other objects.
Spark ML is a new set of ML APIs that allows users to quickly assemble and configure practical machine learning pipelines on top of datasets, which makes it easier to combine multiple algorithms into a single pipeline. For example, an ML algorithm (called estimator) and a set of transformers (for example, a StringIndexer, a StandardScalar, and a VectorAssembler) can be chained together to perform the ML task as stages without needing to run them sequentially.
At this point, I have to inform you of something very useful. Since we will be using Spark MLlib and ML APIs in upcoming chapters too. Therefore, it would be worth fixing some issues in advance. If you're a Windows user, then let me tell you about a very weird issue that you will experience while working with Spark. The thing is that Spark works on Windows, macOS, and Linux. While using Eclipse or IntelliJ IDEA to develop your Spark applications on Windows, you might face an I/O exception error and, consequently, your application might not compile successfully or may be interrupted.
Spark needs a runtime environment for Hadoop on Windows too. Unfortunately, the binary distribution of Spark (v2.4.0, for example) does not contain Windows-native components such as winutils.exe or hadoop.dll. However, these are required (not optional) to run Hadoop on Windows if you cannot ensure the runtime environment, an I/O exception saying the following will appear:
03/02/2019 11:11:10 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
There are two ways to tackle this issue on Windows and from IDEs such as Eclipse and IntelliJ IDEA:
- Download winutls.exe from https://github.com/steveloughran/ winutils/tree/ master/hadoop-2. 7. 1/bin/.
- Download and copy it inside the bin folder in the Spark distribution—for example, spark-2.2.0-bin-hadoop2.7/bin/.
- Select Project | Run Configurations... | Environment | New | and create a variable named HADOOP_HOME, then put the path in the Value field. Here is an example: c:/spark-2.2.0-bin-hadoop2.7/bin/ | OK | Apply | Run.
- Big Data Analytics with Hadoop 3
- LabVIEW虛擬儀器從入門到測控應用130例
- PowerShell 3.0 Advanced Administration Handbook
- Getting Started with Oracle SOA B2B Integration:A Hands-On Tutorial
- 空間傳感器網絡復雜區(qū)域智能監(jiān)測技術
- WordPress Theme Development Beginner's Guide(Third Edition)
- Apache Superset Quick Start Guide
- 電氣控制與PLC技術應用
- Practical Big Data Analytics
- Introduction to R for Business Intelligence
- 運動控制系統(tǒng)(第2版)
- 簡明學中文版Flash動畫制作
- EJB JPA數(shù)據庫持久層開發(fā)實踐詳解
- 自適應學習:人工智能時代的教育革命
- Flink內核原理與實現(xiàn)