官术网_书友最值得收藏!

Data Pipeline in Apache Spark

As we have seen the movie lens use case, it is quite common to run a sequence of machine learning algorithms to process and learn from data. Another example is a simple text document processing workflow, which can include several stages:

  • Split the document's text into words
  • Convert the document's words into a numerical feature vector
  • Learn a prediction model from feature vectors and labels

Spark MLlib represents such a workflow as a Pipeline; it consists of Pipeline Stages in sequence (Transformers and Estimators), which are run in a specific order.

A Pipeline is specified as a sequence of stages. Each stage is a Transformer or an Estimator. Transform converts one data frame into another. Estimator, on the other hand, is a learning algorithm. Pipeline stages are run in order, and the input DataFrame is transformed as it passes through each stage.

In Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel or fitted Pipeline). The transformer's transform() method is executed on the DataFrame.

主站蜘蛛池模板: 称多县| 新巴尔虎左旗| 邳州市| 北辰区| 应用必备| 德令哈市| 尼玛县| 渭源县| 瓮安县| 泌阳县| 双鸭山市| 五台县| 岗巴县| 湘潭县| 当涂县| 乃东县| 曲周县| 红原县| 湘西| 大邑县| 大方县| 固始县| 朝阳区| 光泽县| 叙永县| 宜川县| 黄山市| 九江市| 宁远县| 滦南县| 赣榆县| 泸溪县| 合山市| 富锦市| 许昌县| 江山市| 濮阳市| 宜宾市| 镇原县| 黎平县| 成都市|