- Pentaho Data Integration Beginner's Guide(Second Edition)
- María Carina Roldán
- 1339字
- 2021-07-23 15:46:54
Time for action – reading results of football matches from files
Suppose you have collected several football results in plain files. Your files look like this:
Date;Venue;Country;Matches;Country 07/09/12 15:00;Havana;Cuba;0:3;Honduras; 07/09/12 19:00;Kingston;Jamaica;2:1;USA; 07/09/12 19:30;San Salvador;El Salvador;2:2;Guyana; 07/09/12 19:45;Toronto;Canada;1:0;Panama; 07/09/12 20:00;Guatemala City;Guatemala;3:1;Antigua and Barbuda; 07/09/12 20:05;San Jose;Costa Rica;0:2;Mexico; 11/09/12 19:00;St. John's;Antigua and Barbuda;0:1;Guatemala; 11/09/12 19:30;San Pedro Sula;Honduras;1:0;Cuba; 11/09/12 20:00;Mexico City;Mexico;1:0;Costa Rica; 11/09/12 20:00;Georgetown;Guyana;2:3;El Salvador; 11/09/12 20:05;Panama City;Panama;2:0;Canada; 11/09/12 20:11;Columbus;USA;1:0;Jamaica; -- qualifying for the finals in Brazil 2014 -- -- USA, September
You don't have one, but many files, all with the same format. You now want to unify all the information in one single file. Let's begin by reading the files:
- Create a folder named
pdi_files
. Inside it, create the subfoldersinput
andoutput
. - Use any text editor to type the file shown, and save it under the name
usa_201209.txt
in the folder namedinput
that you just created. Or you can use the file available in the downloadable code. - Start Spoon.
- From the main menu navigate to File | New | Transformation.
- Expand the Input branch of the Steps tree.
- Drag-and-drop to the canvas the icon Text file input.
- Double-click on the Text file input icon, and give the step a name.
- Click on the Browse... button, and search for the file
usa_201209.txt
. - Select the file. The textbox File or directory will be temporarily populated with the full path of the file, for example,
C:\pdi_files\input\usa_201209.txt
. - Click on the Add button. The full text will be moved from the File or directory textbox to the grid. The configuration window should appear as follows:
- Click on the Content tab, and fill it in, as shown in the following screenshot:
- Click on the Fields tab.
- Click on the Get Fields button. The screen should look like the following screenshot:
Note
By default, Kettle assumes DOS format for the file. If you created the file in a UNIX machine, you will be warned that the DOS format for the file was not found. If that's the case, you can change the format in the Content tab.
- In the small window that propose you a number of sample lines, click on Cancel. You will see that the grid was filled with the list of fields found in your file, all of the type
String
. - Click on the Preview rows button, and then click on the OK button. The previewed data should look like the following screenshot:
Note
Note that the second field named
Country
was renamed asCountry_1
. This is because there cannot be two Kettle fields with the same name. - Now it's time to enhance the definitions a bit. Rename the columns as:
match_date
,venue
,home_team
,results
, andaway_team
. You can rename the columns just by overwriting the default values in the grid. - Change the definition of the
match_date
field. AsType
selectDate
, and asFormat
typedd/MM/yy HH:mm
. - Run a new preview. You should see the same data, but with the columns renamed. Also the type of the first column is different. This is not obvious by looking at the screen but you can confirm the type by moving the mouse cursor over the column as you learned to do in the previous chapter.
- Close the window.
- Now expand the Transform branch of steps and drag to the canvas a Select values step.
- Create a hop from the Text file input step to the Select values step.
- Double-click on the Select values step, and use it to remove the venue step. Recall that you do it by selecting or typing the field name in the Remove tab.
- Click on OK.
- Now add a Dummy (do nothing) step. You will find it in the Flow branch of steps.
- Create a hop from the Select values step to the Dummy (do nothing) step. Your transformation should look like the following screenshot:
- Configure the transformation by pressing Ctrl + T or Ctrl + T on Mac, and giving the transformation a name and a description.
- Save the transformation by pressing Ctrl + S or Ctrl + S on Mac.
- Select the Dummy (do nothing) step.
- Click on the Preview button located in the transformation toolbar.
- Click on the Quick Launch button. The following window appears, showing the final data:
What just happened?
You created a very simple transformation that read a single file with the results of football matches.
By using a Text file input step, you told Kettle the full path of your file, along with the characteristics of the file so that Kettle was able to read it correctly. In particular you configured the Content tab to specify that the file had a header and footer made up by two rows (rows that should be ignored). As separator you left the default value (;), but if your file had another separator you could have changed the separator character in this tab. Finally, you defined the name and type of the different fields.
After that, you used a Select values step to remove unwanted fields. A Dummy (do nothing) step was used simply as the destination of the data. You used this step to run a preview and see the final results.
Input files
Files are one of the most used input sources. PDI can take data from several types of files, with almost no limitations.
When you have a file to work with, the first thing you have to do is to specify where the file is, what it looks like, and which kind of values it contains. That is exactly what you did in the first section of this chapter.
With the information you provide, Kettle can create the dataset to work within the current transformation.
Input steps
There are several steps which allow you to take a file as the input data. All those steps are under the Input step category; Text file input, Fixed file input, and Microsoft Excel Input are some of them.
Despite the obvious differences that exist between these types of files, the way to configure the steps has much in common. These are the main properties you have to specify for an input step:
- Name of the step: It is mandatory and must be different for every step in the transformation.
- Name and location of the file: These must be specified of course. It is not mandatory but desirable the existence of the file at the moment you are creating the transformation.
- Content type: This data includes delimiter character, type of encoding, whether a header is present or not, and so on. The list depends on the kind of file chosen. In each case, Kettle proposes default values, so you don't have to enter too much data.
- Fields: Kettle has the facility to get the definitions automatically by clicking on the Get Fields button. However, not always the data types, or size, or formats guessed by Kettle are the expected. So, after getting the fields you may change what you consider more appropriate.
- Filtering: Some steps allow you to filter the data, skip blank rows, read only the first N rows, and so on.
After configuring an input step, you can preview the data just as you did, by clicking on the Preview rows button. This is useful to discover if there is something wrong in the configuration. In that case, you can make the adjustments and preview again, until your data looks fine.
Note
In order to read CSV text files there is an alternative step: CSV file input. This step has a simple but less flexible configuration, but as a counterpart, it provides better performance. One of its advantages is the presence of an option named Lazy conversion. When checked, this flag prevents Kettle from performing unnecessary data type conversions, increasing the speed for reading files.
Reading several files at once
In the previous exercise, you used an input step to read one file. But suppose you have several files, all with the very same structure. That will not be a problem, because with Kettle it is possible to read more than a file at a time.
- ABB工業(yè)機器人編程全集
- 條碼技術及應用
- 人工智能工程化:應用落地與中臺構建
- CorelDRAW X4中文版平面設計50例
- Pig Design Patterns
- Arduino &樂高創(chuàng)意機器人制作教程
- 工業(yè)機器人運動仿真編程實踐:基于Android和OpenGL
- 空間機械臂建模、規(guī)劃與控制
- Linux內(nèi)核精析
- 機器人人工智能
- Cloudera Hadoop大數(shù)據(jù)平臺實戰(zhàn)指南
- 伺服與運動控制系統(tǒng)設計
- 數(shù)字孿生技術與工程實踐:模型+數(shù)據(jù)驅(qū)動的智能系統(tǒng)
- 大學計算機實踐教程
- Linux應用程序設計