官术网_书友最值得收藏!

Time for action – reading results of football matches from files

Suppose you have collected several football results in plain files. Your files look like this:

Date;Venue;Country;Matches;Country
07/09/12 15:00;Havana;Cuba;0:3;Honduras;
07/09/12 19:00;Kingston;Jamaica;2:1;USA;
07/09/12 19:30;San Salvador;El Salvador;2:2;Guyana;
07/09/12 19:45;Toronto;Canada;1:0;Panama;
07/09/12 20:00;Guatemala City;Guatemala;3:1;Antigua and Barbuda;
07/09/12 20:05;San Jose;Costa Rica;0:2;Mexico;
11/09/12 19:00;St. John's;Antigua and Barbuda;0:1;Guatemala;
11/09/12 19:30;San Pedro Sula;Honduras;1:0;Cuba;
11/09/12 20:00;Mexico City;Mexico;1:0;Costa Rica;
11/09/12 20:00;Georgetown;Guyana;2:3;El Salvador;
11/09/12 20:05;Panama City;Panama;2:0;Canada;
11/09/12 20:11;Columbus;USA;1:0;Jamaica;
-- qualifying for the finals in Brazil 2014 --
-- USA, September

You don't have one, but many files, all with the same format. You now want to unify all the information in one single file. Let's begin by reading the files:

  1. Create a folder named pdi_files. Inside it, create the subfolders input and output.
  2. Use any text editor to type the file shown, and save it under the name usa_201209.txt in the folder named input that you just created. Or you can use the file available in the downloadable code.
  3. Start Spoon.
  4. From the main menu navigate to File | New | Transformation.
  5. Expand the Input branch of the Steps tree.
  6. Drag-and-drop to the canvas the icon Text file input.
  7. Double-click on the Text file input icon, and give the step a name.
  8. Click on the Browse... button, and search for the file usa_201209.txt.
  9. Select the file. The textbox File or directory will be temporarily populated with the full path of the file, for example, C:\pdi_files\input\usa_201209.txt.
  10. Click on the Add button. The full text will be moved from the File or directory textbox to the grid. The configuration window should appear as follows:
  11. Click on the Content tab, and fill it in, as shown in the following screenshot:
  12. Click on the Fields tab.
  13. Click on the Get Fields button. The screen should look like the following screenshot:
    Note

    By default, Kettle assumes DOS format for the file. If you created the file in a UNIX machine, you will be warned that the DOS format for the file was not found. If that's the case, you can change the format in the Content tab.

  14. In the small window that propose you a number of sample lines, click on Cancel. You will see that the grid was filled with the list of fields found in your file, all of the type String.
  15. Click on the Preview rows button, and then click on the OK button. The previewed data should look like the following screenshot:
    Note

    Note that the second field named Country was renamed as Country_1. This is because there cannot be two Kettle fields with the same name.

  16. Now it's time to enhance the definitions a bit. Rename the columns as: match_date, venue, home_team, results, and away_team. You can rename the columns just by overwriting the default values in the grid.
  17. Change the definition of the match_date field. As Type select Date, and as Format type dd/MM/yy HH:mm.
  18. Run a new preview. You should see the same data, but with the columns renamed. Also the type of the first column is different. This is not obvious by looking at the screen but you can confirm the type by moving the mouse cursor over the column as you learned to do in the previous chapter.
  19. Close the window.
  20. Now expand the Transform branch of steps and drag to the canvas a Select values step.
  21. Create a hop from the Text file input step to the Select values step.
  22. Double-click on the Select values step, and use it to remove the venue step. Recall that you do it by selecting or typing the field name in the Remove tab.
  23. Click on OK.
  24. Now add a Dummy (do nothing) step. You will find it in the Flow branch of steps.
  25. Create a hop from the Select values step to the Dummy (do nothing) step. Your transformation should look like the following screenshot:
  26. Configure the transformation by pressing Ctrl + T or Ctrl + T on Mac, and giving the transformation a name and a description.
  27. Save the transformation by pressing Ctrl + S or Ctrl + S on Mac.
  28. Select the Dummy (do nothing) step.
  29. Click on the Preview button located in the transformation toolbar.
  30. Click on the Quick Launch button. The following window appears, showing the final data:

What just happened?

You created a very simple transformation that read a single file with the results of football matches.

By using a Text file input step, you told Kettle the full path of your file, along with the characteristics of the file so that Kettle was able to read it correctly. In particular you configured the Content tab to specify that the file had a header and footer made up by two rows (rows that should be ignored). As separator you left the default value (;), but if your file had another separator you could have changed the separator character in this tab. Finally, you defined the name and type of the different fields.

After that, you used a Select values step to remove unwanted fields. A Dummy (do nothing) step was used simply as the destination of the data. You used this step to run a preview and see the final results.

Input files

Files are one of the most used input sources. PDI can take data from several types of files, with almost no limitations.

When you have a file to work with, the first thing you have to do is to specify where the file is, what it looks like, and which kind of values it contains. That is exactly what you did in the first section of this chapter.

With the information you provide, Kettle can create the dataset to work within the current transformation.

Input steps

There are several steps which allow you to take a file as the input data. All those steps are under the Input step category; Text file input, Fixed file input, and Microsoft Excel Input are some of them.

Despite the obvious differences that exist between these types of files, the way to configure the steps has much in common. These are the main properties you have to specify for an input step:

  • Name of the step: It is mandatory and must be different for every step in the transformation.
  • Name and location of the file: These must be specified of course. It is not mandatory but desirable the existence of the file at the moment you are creating the transformation.
  • Content type: This data includes delimiter character, type of encoding, whether a header is present or not, and so on. The list depends on the kind of file chosen. In each case, Kettle proposes default values, so you don't have to enter too much data.
  • Fields: Kettle has the facility to get the definitions automatically by clicking on the Get Fields button. However, not always the data types, or size, or formats guessed by Kettle are the expected. So, after getting the fields you may change what you consider more appropriate.
  • Filtering: Some steps allow you to filter the data, skip blank rows, read only the first N rows, and so on.

After configuring an input step, you can preview the data just as you did, by clicking on the Preview rows button. This is useful to discover if there is something wrong in the configuration. In that case, you can make the adjustments and preview again, until your data looks fine.

Note

In order to read CSV text files there is an alternative step: CSV file input. This step has a simple but less flexible configuration, but as a counterpart, it provides better performance. One of its advantages is the presence of an option named Lazy conversion. When checked, this flag prevents Kettle from performing unnecessary data type conversions, increasing the speed for reading files.

Reading several files at once

In the previous exercise, you used an input step to read one file. But suppose you have several files, all with the very same structure. That will not be a problem, because with Kettle it is possible to read more than a file at a time.

主站蜘蛛池模板: 巴林右旗| 白朗县| 清涧县| 英山县| 江口县| 方正县| 宝坻区| 静安区| 玉树县| 聂拉木县| 师宗县| 阿瓦提县| 苗栗市| 布拖县| 岳西县| 湘西| 西林县| 明溪县| 元氏县| 巴楚县| 湖北省| 庆安县| 神木县| 黔江区| 交城县| 梧州市| 博客| 汕尾市| 碌曲县| 临海市| 中超| 卫辉市| 普洱| 池州市| 马龙县| 临夏县| 娄烦县| 德州市| 梅河口市| 新安县| 商洛市|