官术网_书友最值得收藏!

How it works...

A Pentaho ETL process is created generally by a set of jobs and transformations.

Transformations are workflows whose role is to perform actions on a flow of data by typically applying a set of basic action steps to the data. A transformation can be made by:

  • A set of input steps
  • A set of transformation steps
  • A set of output steps

Input steps take data from external sources and bring them into the transformation. Examples of input steps are as follows:

  • File input steps (text, Excel, properties, other)
  • Table input steps
  • OLAP source input steps or other similar steps

Transformation steps apply elementary business rules to the flow of data; the composition of this set of elementary transformation steps into an organized flow of operations represents a process. Examples of transformation steps are those that perform the following actions:

  • Make operations on strings
  • Make calculations
  • Join different flow paths
  • Apply scripts to the data with the goal of getting the results into other fields

Output steps send the data from the flow to external targets, such as databases, files, web services, or others. Therefore, we can say that transformations act as a sort of unit of work in the context of an entire ETL process. The more a transformation is atomic and concrete in its work, the more we can reuse it throughout other ETL processes.

Jobs are workflows whose role is to orchestrate the execution of a set of tasks: they generally synchronize and prioritize the execution of tasks and give an order of execution based on the success or failure of the execution of the current task. These tasks are basic tasks that either prepare the execution environment for other tasks that are next in the execution workflow or that manage the artifacts produced by tasks that are preceding them in the execution workflow. For example, we have tasks that let us manipulate the files and directories in the local filesystem, tasks that move files between remote servers through FTP or SSH, and tasks that check the availability of a table or the content of a table. Any job can call other jobs or transformations to design more complex processes. Therefore, generally speaking, jobs orchestrate the execution of jobs and transformations into large ETL processes.

In our case, we have a very simple example with a job and a transformation to support our recipes' experiments. The transformation gets data from a text file that contains a set of customers by country. After the data from the text file is loaded, it filters the dataflow by country and prints the result on an Excel file. The filter is made using a parameter that you set at the time you start the job and the filter step. The job checks if the previous file exists, and if so, deletes it and then calls the transformation for a new extraction. The job also has some failure paths to manage any sort of error condition that could occur during the processing of the tasks. The failure paths terminate with a step that aborts the job, marking it as failed.

主站蜘蛛池模板: 漠河县| 襄樊市| 长治市| 紫金县| 浪卡子县| 台安县| 灌南县| 石狮市| 邵阳市| 麻江县| 吉水县| 文成县| 卢湾区| 石嘴山市| 芷江| 巫山县| 黑河市| 平利县| 莱西市| 泊头市| 佛坪县| 巩义市| 如东县| 陇西县| 成武县| 赣州市| 民丰县| 合肥市| 广南县| 宣化县| 纳雍县| 石楼县| 拉萨市| 呼和浩特市| 苗栗市| 宁波市| 子洲县| 通道| 河津市| 乐安县| 赣榆县|