不朽情缘打法

書名： Pentaho Data Integration Beginner's Guide（Second Edition）
作者名： María Carina Roldán
本章字數： 884字
更新時間： 2021-07-23 15:46:54

Time for action – reading all your files at a time using a single text file input step and regular expressions

You can do the same thing that you did previously by using a different notation. Follow these instructions:

Open the transformation that reads several files and double-click on the Input step.
Delete the lines with the names of the files.
In the first row of the grid, under the File/Directory column, type the full path of the input folder, for example C:\pdi_files\input.
Under the Wildcard (RegExp) column type (usa|europe)_[0-9]{6}\.txt.
Click on the Show filename(s)... button. You will see the list of files that match the expression:
Close the tiny window and click on Preview rows to confirm that the rows shown belong to the files that match the expression you typed.

What just happened?

In this case, all the filenames follow a pattern: usa_201209.txt, usa_201210.txt, and so on. So, in order to specify the names of the files you used a regular expression. In the column File/Directory you put the static part of the names, while in the Wildcard (RegExp) column you put the regular expression with the pattern that a file must follow to be considered: the name of the region which should be either usa or europe, followed by an underscore and the six numbers representing the period, and then the extension .txt. Then, all files that matched the expression were considered as input files.

Regular expressions

There are many places inside Kettle where you may have to provide a regular expression. A regular expression is much more than specifying the known wildcards ? and *.

In the following table you have some examples of regular expressions you may use to specify filenames:

Note

Please note that the * wildcard does not work the same as it does on the command line. If you want to match any character, the * has to be preceded by a dot.

Here you have some useful links in case you want to know more about regular expressions:

Read about regular expressions at http://www.regular-expressions.info/quickstart.html
Read the Java Regular Expression tutorial at http://java.sun.com/docs/books/tutorial/essential/regex/
Read about Java Regular Expression pattern syntax at http://java.sun.com/javase/7/docs/api/java/util/regex/Pattern.html

Troubleshooting reading files

Despite the simplicity of reading files with PDI, obstacles and errors appear. Many times the solution is simple, but difficult to find if you are new to PDI. The following table gives you a list of common problems and possible solutions to take into account while reading and previewing a file:

Have a go hero – exploring your own files

Try to read your own text files from Kettle. You must have several files with different kinds of data, different separators, with or without a header or footer. You can also search for files over the Internet; it has plenty of files there to download and play with. After configuring the input step, do a preview. If the data is not shown properly, fix the configuration and preview again until you are sure that the data is read as expected. If you have trouble reading the files, please refer to the section Troubleshooting reading files for diagnosis and possible ways to solve the problems.

Pop quiz – providing a list of text files using regular expressions

Q1. In the previous exercise you read four different files by using a single regular expression: (usa|europe)_[0-9]{6}\.txt. Which of the following options is equivalent to that one? In other words, which of the following serves for reading the same set of files? You can choose more than one option.

Replacing that regular expression with this one: (usa|europe)_[0-9][0-9][0-9][0-9][0-9][0-9]\.txt.
Filling the grid with two lines: one with the regular expression usa_[0-9]{6}\.txt and a second line with this expression: europe_[0-9]{6}\.txt.

Q2. Try reproducing the previous sections using a CSV file input step instead of a Text file input step. Identify whether the following statements are true or false:

There is no difference in using a Text file input step or a CSV file input step.
It is not possible to read the sample files with a CSV file input.
It is not possible to read more than one file at a time with a CSV file input.
It is not possible to specify a regular expression for reading files with a CSV file input.

Have a go hero – measuring the performance of input steps

The previous Pop quiz was not the best propaganda for the CSV file input step! Let's change that reputation by doing some tests.

In the material that you can download from the book's website there is a transformation that generates a text file with 10 million rows of dummy data.

Run the transformation for generating that file (you can even modify the transformation to add new fields or generate more data).

Create three different transformations for reading the file:

With a Text file input step.
With a CSV file input step. Uncheck the Lazy conversion flag which is on by default.
With a CSV file input step, making sure that the Lazy conversion option is on.

Run one transformation at a time and take note of the metrics. No matter how slow or fast your computer is, you should note that the CSV file input step performs better than the Text file input step, and even better when using the Lazy conversion option.

官术网_书友最值得收藏!

Pentaho Data Integration Beginner's Guide（Second Edition）