官术网_书友最值得收藏!

Data Pipeline in Apache Spark

As we have seen the movie lens use case, it is quite common to run a sequence of machine learning algorithms to process and learn from data. Another example is a simple text document processing workflow, which can include several stages:

  • Split the document's text into words
  • Convert the document's words into a numerical feature vector
  • Learn a prediction model from feature vectors and labels

Spark MLlib represents such a workflow as a Pipeline; it consists of Pipeline Stages in sequence (Transformers and Estimators), which are run in a specific order.

A Pipeline is specified as a sequence of stages. Each stage is a Transformer or an Estimator. Transform converts one data frame into another. Estimator, on the other hand, is a learning algorithm. Pipeline stages are run in order, and the input DataFrame is transformed as it passes through each stage.

In Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel or fitted Pipeline). The transformer's transform() method is executed on the DataFrame.

主站蜘蛛池模板: 南安市| 偃师市| 申扎县| 乌拉特前旗| 佛山市| 平泉县| 通化市| 丰县| 长沙县| 翼城县| 吴江市| 阳江市| 新乡市| 恩平市| 翁源县| 铜梁县| 固镇县| 威远县| 甘孜| 花莲县| 突泉县| 泸西县| 闵行区| 泽普县| 嘉峪关市| 措美县| 古蔺县| 麻江县| 扶余县| 泸西县| 东乡族自治县| 肇州县| 旌德县| 延寿县| 黎平县| 刚察县| 古丈县| 晋江市| 永福县| 探索| 丹棱县|