- Pentaho Data Integration Beginner's Guide(Second Edition)
- María Carina Roldán
- 1037字
- 2021-07-23 15:46:57
Time for action – sorting information about matches with the Sort rows step
In Chapter 3, Manipulating Real-world Data, you created a file with information about football matches. The following lines of code show a variant of that file. The information is the same, and only the structure of the data has changed:
region;match_date;type;team;goals europe;07-09;home;Iceland;2 europe;07-09;away;Norway;0 europe;07-09;home;Russia;2 europe;07-09;away;Northern Ireland;0 europe;07-09;home;Liechtenstein;1 europe;07-09;away;Bosnia-Herzegovina;8 europe;07-09;home;Wales;0 europe;07-09;away;Belgium;2 europe;07-09;home;Malta;0 europe;07-09;away;Armenia;1 ...
Now you want to see the same information, but sorted by date and team.
- First of all, download this sample file from the Packt Publishing website (www.packtpub.com/support). The name of the file is
matches.txt
. - Now create a transformation, give it a name and description, and save it in a folder of your choice.
- By using a Text file input step, read the file. Provide the name and location of the file, check the Content tab to see that everything matches your file, and fill in the Fields tab with the shown columns:
region
,match_date
,type
,team
,goals
.Note
If you will not use the
match_date
field for date operations (for example, adding dates), you don't have to define the first column as aDate
. The same is valid for thegoals
column: You only need to define the column asInteger
if you plan to do math with it. In the other case, aString
is enough. - Do a preview just to confirm that the step is well configured.
- Add a Select values step to select and reorder the columns as follows:
team
,type
,match_date
,goals
. - From the Transform category of steps add a Sort rows step, and create a link from the Select values step towards this new step.
- Double-click on the Sort rows step and configure it, as shown in the following screenshot:
- Click on OK.
- At the end, add a Dummy step. Your transformation should look like this:
- Save the transformation.
- Select the last step and do a preview. You should see this:
What just happened?
You read a file with a list of football match results and sorted the rows based on two columns: team
(ascendant) and type
(descendant). Using this method, your final data was ordered by team. Then, for any given team, the home
values were first, followed by the away
values. This is due to the ascending flag that is you set N
under the Ascending column.
Note that the Select values step is not mandatory here. We just used it for having the columns team
and type
to the left, so the sorted dataset was easy to read.
Tip
Sorting data
For small datasets, the sorting algorithm runs mainly using the JVM memory. When the number of rows exceeds the specified sort size, it works differently. Suppose that you put 5000
as the value of the Sort size field. Every 5000 rows, the process sorts them and writes them to a temporary file. When there are no more rows, it does a merge sort on all of those files and gives you back the sorted dataset. You can conclude that for huge datasets a lot of reading and writing operations are done on your disk, which slows down the whole transformation. Fortunately, you can change the number of rows in memory (one million by default) by setting a new value in the Sort size (rows in memory) textbox. The bigger this number, the faster the sorting process. In summary, the amount of memory allocated to the process will offset the speed gained, and as soon as the JVM has to start using swap space, the performance will degrade.
Note that a sort size that works in your system may not work in a machine with a different configuration. To avoid that risk you can use a different approach. In the Sort rows configuration window you can set a Free memory threshold (in %) value. The process begins to use temporary files when the percentage of available memory drops below the indicated threshold. The lower the percentage, the faster the process.
You cannot, however, just set a small free memory threshold and expect that everything runs fine
Note
As it is not possible to know the exact amount of free memory, it's not recommended to set a very small free memory threshold. You definitely should not use that option in complex transformations, or when there is more than one sort going on, as you could still run out of memory.
The final steps were added just to preview the result of the transformation. You could have previewed the data by selecting the Sort rows step. The idea, however, is that after this test, you can replace the Dummy step with any of the output steps you already know, or delete it and continue transforming the data.
Tip
You have used the Dummy step several times but still nothing has been said about it. Mainly because it does nothing! However, you can use it as a placeholder for testing purposes, as in the previous exercise.
Have a go hero – listing the last match played by each team
Read the matches.txt
file and, as output, generate a file with the following structure and data: one team by row, along with information about its last played football match. The output should be something like this:
team;match_date;goals Albania;16-10;1 Andorra;16-10;0 Antigua and Barbuda;16-10;1 Armenia;12-10;1 Austria;16-10;4 Azerbaijan;16-10;0 Belarus;16-10;2 Belgium;16-10;2 Bosnia-Herzegovina;16-10;3 Bulgaria;16-10;0
Tip
Use two Sort rows steps. Use the first for sorting as needed. In the second Sort rows step experiment with the flag named Only pass unique rows? (verifies keys only).
As it's not really intuitive, let's briefly explain the purpose of this flag.
Note
The Only pass unique rows? flag filters out duplicate rows leaving only unique occurrences. The uniqueness is forced on the list of fields by which you sorted the dataset. When there are two or more identical rows, only the first is passed to the next step(s).
This flag behaves exactly as a step that we haven't seen, but one that you can try as well: the Unique rows step, which you will find in the Transform category of steps. This step discards duplicate rows and keeps only unique occurrences. In this case, the uniqueness is also forced on a specific list of fields.
- 手把手教你玩轉RPA:基于UiPath和Blue Prism
- Hands-On Cloud Solutions with Azure
- Learning Apache Cassandra(Second Edition)
- Expert AWS Development
- 嵌入式操作系統
- 單片機技術一學就會
- PostgreSQL 10 Administration Cookbook
- 統計挖掘與機器學習:大數據預測建模和分析技術(原書第3版)
- 會聲會影X4中文版從入門到精通
- Hands-On Business Intelligence with Qlik Sense
- 納米集成電路制造工藝(第2版)
- Mastering DynamoDB
- R Statistics Cookbook
- 單片機硬件接口電路及實例解析
- 單片機與微機原理及應用