- Machine Learning with Spark(Second Edition)
- Rajdeep Dua Manpreet Singh Ghotra Nick Pentreath
- 188字
- 2021-07-09 21:07:55
Data Pipeline in Apache Spark
As we have seen the movie lens use case, it is quite common to run a sequence of machine learning algorithms to process and learn from data. Another example is a simple text document processing workflow, which can include several stages:
- Split the document's text into words
- Convert the document's words into a numerical feature vector
- Learn a prediction model from feature vectors and labels
Spark MLlib represents such a workflow as a Pipeline; it consists of Pipeline Stages in sequence (Transformers and Estimators), which are run in a specific order.
A Pipeline is specified as a sequence of stages. Each stage is a Transformer or an Estimator. Transform converts one data frame into another. Estimator, on the other hand, is a learning algorithm. Pipeline stages are run in order, and the input DataFrame is transformed as it passes through each stage.
In Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel or fitted Pipeline). The transformer's transform() method is executed on the DataFrame.
- Big Data Analytics with Hadoop 3
- Java實用組件集
- Getting Started with Oracle SOA B2B Integration:A Hands-On Tutorial
- 21天學(xué)通C++
- OpenStack Cloud Computing Cookbook(Second Edition)
- Ceph:Designing and Implementing Scalable Storage Systems
- 空間機械臂建模、規(guī)劃與控制
- INSTANT Puppet 3 Starter
- Artificial Intelligence By Example
- Machine Learning with Spark(Second Edition)
- 自適應(yīng)學(xué)習(xí):人工智能時代的教育革命
- Linux常用命令簡明手冊
- 網(wǎng)絡(luò)安全概論
- 軟測之魂
- PostgreSQL High Performance Cookbook