- Instant Pentaho Data Integration Kitchen
- Sergio Ramazzina
- 520字
- 2021-08-13 16:35:12
How it works...
A Pentaho ETL process is created generally by a set of jobs and transformations.
Transformations are workflows whose role is to perform actions on a flow of data by typically applying a set of basic action steps to the data. A transformation can be made by:
- A set of input steps
- A set of transformation steps
- A set of output steps
Input steps take data from external sources and bring them into the transformation. Examples of input steps are as follows:
- File input steps (text, Excel, properties, other)
- Table input steps
- OLAP source input steps or other similar steps
Transformation steps apply elementary business rules to the flow of data; the composition of this set of elementary transformation steps into an organized flow of operations represents a process. Examples of transformation steps are those that perform the following actions:
- Make operations on strings
- Make calculations
- Join different flow paths
- Apply scripts to the data with the goal of getting the results into other fields
Output steps send the data from the flow to external targets, such as databases, files, web services, or others. Therefore, we can say that transformations act as a sort of unit of work in the context of an entire ETL process. The more a transformation is atomic and concrete in its work, the more we can reuse it throughout other ETL processes.
Jobs are workflows whose role is to orchestrate the execution of a set of tasks: they generally synchronize and prioritize the execution of tasks and give an order of execution based on the success or failure of the execution of the current task. These tasks are basic tasks that either prepare the execution environment for other tasks that are next in the execution workflow or that manage the artifacts produced by tasks that are preceding them in the execution workflow. For example, we have tasks that let us manipulate the files and directories in the local filesystem, tasks that move files between remote servers through FTP or SSH, and tasks that check the availability of a table or the content of a table. Any job can call other jobs or transformations to design more complex processes. Therefore, generally speaking, jobs orchestrate the execution of jobs and transformations into large ETL processes.
In our case, we have a very simple example with a job and a transformation to support our recipes' experiments. The transformation gets data from a text file that contains a set of customers by country. After the data from the text file is loaded, it filters the dataflow by country and prints the result on an Excel file. The filter is made using a parameter that you set at the time you start the job and the filter step. The job checks if the previous file exists, and if so, deletes it and then calls the transformation for a new extraction. The job also has some failure paths to manage any sort of error condition that could occur during the processing of the tasks. The failure paths terminate with a step that aborts the job, marking it as failed.
- Learn Blockchain Programming with JavaScript
- PyTorch Artificial Intelligence Fundamentals
- 從零開始學(xué)Linux編程
- C#開發(fā)案例精粹
- Angular Design Patterns
- 程序員的成長(zhǎng)課
- 精益軟件開發(fā)管理之道
- Learning TypeScript
- 軟件再工程:優(yōu)化現(xiàn)有軟件系統(tǒng)的方法與最佳實(shí)踐
- JavaScript高級(jí)程序設(shè)計(jì)(第4版)
- Python Business Intelligence Cookbook
- jQuery權(quán)威指南
- 測(cè)試基地實(shí)訓(xùn)指導(dǎo)
- PHP程序員面試算法寶典
- Mastering PyCharm