官术网_书友最值得收藏!

Data Pipeline in Apache Spark

As we have seen the movie lens use case, it is quite common to run a sequence of machine learning algorithms to process and learn from data. Another example is a simple text document processing workflow, which can include several stages:

  • Split the document's text into words
  • Convert the document's words into a numerical feature vector
  • Learn a prediction model from feature vectors and labels

Spark MLlib represents such a workflow as a Pipeline; it consists of Pipeline Stages in sequence (Transformers and Estimators), which are run in a specific order.

A Pipeline is specified as a sequence of stages. Each stage is a Transformer or an Estimator. Transform converts one data frame into another. Estimator, on the other hand, is a learning algorithm. Pipeline stages are run in order, and the input DataFrame is transformed as it passes through each stage.

In Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel or fitted Pipeline). The transformer's transform() method is executed on the DataFrame.

主站蜘蛛池模板: 丹巴县| 遂溪县| 开远市| 全椒县| 分宜县| 封丘县| 新蔡县| 通城县| 丹巴县| 宜丰县| 安福县| 白城市| 饶阳县| 织金县| 清新县| 仪征市| 邢台县| 聂拉木县| 虹口区| 清水县| 万山特区| 鄂尔多斯市| 卢龙县| 栾川县| 贺兰县| 安乡县| 新巴尔虎左旗| 囊谦县| 阜宁县| 隆回县| 朝阳市| 武穴市| 深泽县| 竹山县| 彝良县| 三穗县| 郓城县| 民乐县| 女性| 双辽市| 嘉鱼县|