- Pentaho Data Integration Beginner's Guide(Second Edition)
- María Carina Roldán
- 683字
- 2021-07-23 15:46:58
Time for action – counting frequent words by filtering
On this occasion, you have some plain text files, and you want to know what is said in them. You don't want to read them, so you decide to count the times that the words appear in the text, and see the most frequent ones to get an idea of what the files are about. The first of our two tutorials on filtering is about counting the words in the file.
Note
Before starting, you'll need at least one text file to play with. The text file used in this tutorial is named smcng10.txt
, and is available for you to download from Packt Publishing's website, www.packtpub.com.
Let's work.
Tip
This section and the following sections have many steps. So, feel free to preview the data from time-to-time. In this way, you make sure that you are doing well, and understand what filtering is about, as you progress in the design of your transformation.
- Create a new transformation.
- By using a Text file input step, read your file. The trick here is to put as a Separator, a sign you are not expecting in the file, such as
|
. By doing so, of the whole lines would be recognized as a single field. Configure the Fields tab by defining a single String field namedline
. - This particular file has a big header describing the content and origin of it. We are not interested in those lines, so in the Content tab, as Header type
378
, which is the number of lines that precedes the specific content we're interested in. - From the Transform category of steps, drag to the canvas a Split field to rows step, and create a hop from the Text file input step to this one.
- Configure the step as follows:
- With this last step selected, do a preview. Your preview window should look as follows:
- Close the preview window.
- Add a Select values step to remove the
line
field.Note
It's not mandatory to remove this field, but as it will not be used any longer, removing it will make future previews clearer.
- Expand the Flow category of steps, and drag a Filter rows step to the work area.
- Create a hop from the last step to the Filter rows step.
- Edit the Filter rows step by double-clicking on it.
- Click on the <field> textbox to the left of the = sign. The list of fields appears. Select word.
- Click on the = sign. A list of operations appears. Select IS NOT NULL.
- The window looks like the following screenshot:
- Click on OK.
- From the Transform category of steps, drag a Sort rows step to the canvas.
- Create a hop from the Filter rows step, to the Sort rows step. When asked for the kind of hop, select Main output of step, as shown in the following screenshot:
- Use the last step to sort the rows by
word
(ascending). - From the Statistics category, drag-and-drop a Group by step on the canvas, and add it to the stream, after the Sort rows step.
- Configure the grids in the Group by configuration window, as shown in the following screenshot:
- With the Group by step selected, do a preview. You will see this:
What just happened?
You read a regular plain file, and counted the words appearing in it.
The first thing you did was read the plain file, and split the lines so that every word became a new row in the dataset. For example, as a consequence of splitting the line:
subsidence; comparison with the Portillo chain.
The following rows were generated:

Thus, a new field named word
became the basis for your transformation, and therefore you removed the line
field.
First of all, you discarded rows with null words. You did it by using a filter with the condition word IS NOT NULL
.
Then, you counted the words by using the Group by step you learned in the previous tutorial. Doing it this way, you got a preliminary list of the words in the file, and the number of occurrences of each word.