- Pentaho Data Integration Beginner's Guide(Second Edition)
- María Carina Roldán
- 884字
- 2021-07-23 15:46:54
Time for action – reading all your files at a time using a single text file input step and regular expressions
You can do the same thing that you did previously by using a different notation. Follow these instructions:
- Open the transformation that reads several files and double-click on the Input step.
- Delete the lines with the names of the files.
- In the first row of the grid, under the
File/Directory
column, type the full path of the input folder, for exampleC:\pdi_files\input
. - Under the
Wildcard (RegExp)
column type(usa|europe)_[0-9]{6}\.txt
. - Click on the Show filename(s)... button. You will see the list of files that match the expression:
- Close the tiny window and click on Preview rows to confirm that the rows shown belong to the files that match the expression you typed.
What just happened?
In this case, all the filenames follow a pattern: usa_201209.txt
, usa_201210.txt
, and so on. So, in order to specify the names of the files you used a regular expression. In the column File/Directory
you put the static part of the names, while in the Wildcard (RegExp)
column you put the regular expression with the pattern that a file must follow to be considered: the name of the region which should be either usa
or europe
, followed by an underscore and the six numbers representing the period, and then the extension .txt
. Then, all files that matched the expression were considered as input files.
Regular expressions
There are many places inside Kettle where you may have to provide a regular expression. A regular expression is much more than specifying the known wildcards ?
and *
.
In the following table you have some examples of regular expressions you may use to specify filenames:

Note
Please note that the *
wildcard does not work the same as it does on the command line. If you want to match any character, the *
has to be preceded by a dot.
Here you have some useful links in case you want to know more about regular expressions:
- Read about regular expressions at http://www.regular-expressions.info/quickstart.html
- Read the Java Regular Expression tutorial at http://java.sun.com/docs/books/tutorial/essential/regex/
- Read about Java Regular Expression pattern syntax at http://java.sun.com/javase/7/docs/api/java/util/regex/Pattern.html
Troubleshooting reading files
Despite the simplicity of reading files with PDI, obstacles and errors appear. Many times the solution is simple, but difficult to find if you are new to PDI. The following table gives you a list of common problems and possible solutions to take into account while reading and previewing a file:

Have a go hero – exploring your own files
Try to read your own text files from Kettle. You must have several files with different kinds of data, different separators, with or without a header or footer. You can also search for files over the Internet; it has plenty of files there to download and play with. After configuring the input step, do a preview. If the data is not shown properly, fix the configuration and preview again until you are sure that the data is read as expected. If you have trouble reading the files, please refer to the section Troubleshooting reading files for diagnosis and possible ways to solve the problems.
Pop quiz – providing a list of text files using regular expressions
Q1. In the previous exercise you read four different files by using a single regular expression: (usa|europe)_[0-9]{6}\.txt
. Which of the following options is equivalent to that one? In other words, which of the following serves for reading the same set of files? You can choose more than one option.
- Replacing that regular expression with this one:
(usa|europe)_[0-9][0-9][0-9][0-9][0-9][0-9]\.txt
. - Filling the grid with two lines: one with the regular expression
usa_[0-9]{6}\.txt
and a second line with this expression:europe_[0-9]{6}\.txt
.
Q2. Try reproducing the previous sections using a CSV file input step instead of a Text file input step. Identify whether the following statements are true or false:
- There is no difference in using a Text file input step or a CSV file input step.
- It is not possible to read the sample files with a CSV file input.
- It is not possible to read more than one file at a time with a CSV file input.
- It is not possible to specify a regular expression for reading files with a CSV file input.
Have a go hero – measuring the performance of input steps
The previous Pop quiz was not the best propaganda for the CSV file input step! Let's change that reputation by doing some tests.
In the material that you can download from the book's website there is a transformation that generates a text file with 10 million rows of dummy data.
Run the transformation for generating that file (you can even modify the transformation to add new fields or generate more data).
Create three different transformations for reading the file:
- With a Text file input step.
- With a CSV file input step. Uncheck the Lazy conversion flag which is on by default.
- With a CSV file input step, making sure that the Lazy conversion option is on.
Run one transformation at a time and take note of the metrics. No matter how slow or fast your computer is, you should note that the CSV file input step performs better than the Text file input step, and even better when using the Lazy conversion option.
- 數據展現的藝術
- Visual FoxPro 6.0數據庫與程序設計
- 樂高創意機器人教程(中級 下冊 10~16歲) (青少年iCAN+創新創意實踐指導叢書)
- 21天學通ASP.NET
- B2B2C網上商城開發指南
- 悟透JavaScript
- Windows Server 2003系統安全管理
- SAP Business Intelligence Quick Start Guide
- 在實戰中成長:Windows Forms開發之路
- INSTANT Puppet 3 Starter
- Excel 2007終極技巧金典
- 菜鳥起飛電腦組裝·維護與故障排查
- 計算機辦公應用培訓教程
- 新世紀Photoshop CS6中文版應用教程
- Appcelerator Titanium Smartphone App Development Cookbook(Second Edition)