官术网_书友最值得收藏!

How it works...

A Pentaho ETL process is created generally by a set of jobs and transformations.

Transformations are workflows whose role is to perform actions on a flow of data by typically applying a set of basic action steps to the data. A transformation can be made by:

  • A set of input steps
  • A set of transformation steps
  • A set of output steps

Input steps take data from external sources and bring them into the transformation. Examples of input steps are as follows:

  • File input steps (text, Excel, properties, other)
  • Table input steps
  • OLAP source input steps or other similar steps

Transformation steps apply elementary business rules to the flow of data; the composition of this set of elementary transformation steps into an organized flow of operations represents a process. Examples of transformation steps are those that perform the following actions:

  • Make operations on strings
  • Make calculations
  • Join different flow paths
  • Apply scripts to the data with the goal of getting the results into other fields

Output steps send the data from the flow to external targets, such as databases, files, web services, or others. Therefore, we can say that transformations act as a sort of unit of work in the context of an entire ETL process. The more a transformation is atomic and concrete in its work, the more we can reuse it throughout other ETL processes.

Jobs are workflows whose role is to orchestrate the execution of a set of tasks: they generally synchronize and prioritize the execution of tasks and give an order of execution based on the success or failure of the execution of the current task. These tasks are basic tasks that either prepare the execution environment for other tasks that are next in the execution workflow or that manage the artifacts produced by tasks that are preceding them in the execution workflow. For example, we have tasks that let us manipulate the files and directories in the local filesystem, tasks that move files between remote servers through FTP or SSH, and tasks that check the availability of a table or the content of a table. Any job can call other jobs or transformations to design more complex processes. Therefore, generally speaking, jobs orchestrate the execution of jobs and transformations into large ETL processes.

In our case, we have a very simple example with a job and a transformation to support our recipes' experiments. The transformation gets data from a text file that contains a set of customers by country. After the data from the text file is loaded, it filters the dataflow by country and prints the result on an Excel file. The filter is made using a parameter that you set at the time you start the job and the filter step. The job checks if the previous file exists, and if so, deletes it and then calls the transformation for a new extraction. The job also has some failure paths to manage any sort of error condition that could occur during the processing of the tasks. The failure paths terminate with a step that aborts the job, marking it as failed.

主站蜘蛛池模板: 海林市| 贺州市| 南安市| 平和县| 呼玛县| 原平市| 北安市| 闸北区| 昆山市| 娱乐| 保山市| 那坡县| 运城市| 堆龙德庆县| 平塘县| 六安市| 昭通市| 嘉义市| 平远县| 江华| 五莲县| 固安县| 永平县| 怀来县| 永康市| 从化市| 施秉县| 香格里拉县| 会东县| 冷水江市| 舒兰市| 永安市| 莲花县| 依兰县| 安化县| 广昌县| 叙永县| 儋州市| 张家界市| 醴陵市| 上杭县|