- Pentaho Data Integration Beginner's Guide(Second Edition)
- María Carina Roldán
- 1037字
- 2021-07-23 15:46:57
Time for action – sorting information about matches with the Sort rows step
In Chapter 3, Manipulating Real-world Data, you created a file with information about football matches. The following lines of code show a variant of that file. The information is the same, and only the structure of the data has changed:
region;match_date;type;team;goals europe;07-09;home;Iceland;2 europe;07-09;away;Norway;0 europe;07-09;home;Russia;2 europe;07-09;away;Northern Ireland;0 europe;07-09;home;Liechtenstein;1 europe;07-09;away;Bosnia-Herzegovina;8 europe;07-09;home;Wales;0 europe;07-09;away;Belgium;2 europe;07-09;home;Malta;0 europe;07-09;away;Armenia;1 ...
Now you want to see the same information, but sorted by date and team.
- First of all, download this sample file from the Packt Publishing website (www.packtpub.com/support). The name of the file is
matches.txt
. - Now create a transformation, give it a name and description, and save it in a folder of your choice.
- By using a Text file input step, read the file. Provide the name and location of the file, check the Content tab to see that everything matches your file, and fill in the Fields tab with the shown columns:
region
,match_date
,type
,team
,goals
.Note
If you will not use the
match_date
field for date operations (for example, adding dates), you don't have to define the first column as aDate
. The same is valid for thegoals
column: You only need to define the column asInteger
if you plan to do math with it. In the other case, aString
is enough. - Do a preview just to confirm that the step is well configured.
- Add a Select values step to select and reorder the columns as follows:
team
,type
,match_date
,goals
. - From the Transform category of steps add a Sort rows step, and create a link from the Select values step towards this new step.
- Double-click on the Sort rows step and configure it, as shown in the following screenshot:
- Click on OK.
- At the end, add a Dummy step. Your transformation should look like this:
- Save the transformation.
- Select the last step and do a preview. You should see this:
What just happened?
You read a file with a list of football match results and sorted the rows based on two columns: team
(ascendant) and type
(descendant). Using this method, your final data was ordered by team. Then, for any given team, the home
values were first, followed by the away
values. This is due to the ascending flag that is you set N
under the Ascending column.
Note that the Select values step is not mandatory here. We just used it for having the columns team
and type
to the left, so the sorted dataset was easy to read.
Tip
Sorting data
For small datasets, the sorting algorithm runs mainly using the JVM memory. When the number of rows exceeds the specified sort size, it works differently. Suppose that you put 5000
as the value of the Sort size field. Every 5000 rows, the process sorts them and writes them to a temporary file. When there are no more rows, it does a merge sort on all of those files and gives you back the sorted dataset. You can conclude that for huge datasets a lot of reading and writing operations are done on your disk, which slows down the whole transformation. Fortunately, you can change the number of rows in memory (one million by default) by setting a new value in the Sort size (rows in memory) textbox. The bigger this number, the faster the sorting process. In summary, the amount of memory allocated to the process will offset the speed gained, and as soon as the JVM has to start using swap space, the performance will degrade.
Note that a sort size that works in your system may not work in a machine with a different configuration. To avoid that risk you can use a different approach. In the Sort rows configuration window you can set a Free memory threshold (in %) value. The process begins to use temporary files when the percentage of available memory drops below the indicated threshold. The lower the percentage, the faster the process.
You cannot, however, just set a small free memory threshold and expect that everything runs fine
Note
As it is not possible to know the exact amount of free memory, it's not recommended to set a very small free memory threshold. You definitely should not use that option in complex transformations, or when there is more than one sort going on, as you could still run out of memory.
The final steps were added just to preview the result of the transformation. You could have previewed the data by selecting the Sort rows step. The idea, however, is that after this test, you can replace the Dummy step with any of the output steps you already know, or delete it and continue transforming the data.
Tip
You have used the Dummy step several times but still nothing has been said about it. Mainly because it does nothing! However, you can use it as a placeholder for testing purposes, as in the previous exercise.
Have a go hero – listing the last match played by each team
Read the matches.txt
file and, as output, generate a file with the following structure and data: one team by row, along with information about its last played football match. The output should be something like this:
team;match_date;goals Albania;16-10;1 Andorra;16-10;0 Antigua and Barbuda;16-10;1 Armenia;12-10;1 Austria;16-10;4 Azerbaijan;16-10;0 Belarus;16-10;2 Belgium;16-10;2 Bosnia-Herzegovina;16-10;3 Bulgaria;16-10;0
Tip
Use two Sort rows steps. Use the first for sorting as needed. In the second Sort rows step experiment with the flag named Only pass unique rows? (verifies keys only).
As it's not really intuitive, let's briefly explain the purpose of this flag.
Note
The Only pass unique rows? flag filters out duplicate rows leaving only unique occurrences. The uniqueness is forced on the list of fields by which you sorted the dataset. When there are two or more identical rows, only the first is passed to the next step(s).
This flag behaves exactly as a step that we haven't seen, but one that you can try as well: the Unique rows step, which you will find in the Transform category of steps. This step discards duplicate rows and keeps only unique occurrences. In this case, the uniqueness is also forced on a specific list of fields.
- Microsoft Power BI Quick Start Guide
- Hands-On Artificial Intelligence on Amazon Web Services
- 傳感器技術(shù)實驗教程
- 錯覺:AI 如何通過數(shù)據(jù)挖掘誤導(dǎo)我們
- Natural Language Processing Fundamentals
- 數(shù)據(jù)庫原理與應(yīng)用技術(shù)學(xué)習(xí)指導(dǎo)
- 空間傳感器網(wǎng)絡(luò)復(fù)雜區(qū)域智能監(jiān)測技術(shù)
- 系統(tǒng)安裝與重裝
- Building a BeagleBone Black Super Cluster
- 云計算和大數(shù)據(jù)的應(yīng)用
- Machine Learning Algorithms(Second Edition)
- Learn QGIS
- PowerMill 2020五軸數(shù)控加工編程應(yīng)用實例
- HBase Essentials
- Xilinx FPGA高級設(shè)計及應(yīng)用