官术网_书友最值得收藏!

Time for action – sorting information about matches with the Sort rows step

In Chapter 3, Manipulating Real-world Data, you created a file with information about football matches. The following lines of code show a variant of that file. The information is the same, and only the structure of the data has changed:

region;match_date;type;team;goals
europe;07-09;home;Iceland;2
europe;07-09;away;Norway;0
europe;07-09;home;Russia;2
europe;07-09;away;Northern Ireland;0
europe;07-09;home;Liechtenstein;1
europe;07-09;away;Bosnia-Herzegovina;8
europe;07-09;home;Wales;0
europe;07-09;away;Belgium;2
europe;07-09;home;Malta;0
europe;07-09;away;Armenia;1
...

Now you want to see the same information, but sorted by date and team.

  1. First of all, download this sample file from the Packt Publishing website (www.packtpub.com/support). The name of the file is matches.txt.
  2. Now create a transformation, give it a name and description, and save it in a folder of your choice.
  3. By using a Text file input step, read the file. Provide the name and location of the file, check the Content tab to see that everything matches your file, and fill in the Fields tab with the shown columns: region, match_date, type, team, goals.
    Note

    If you will not use the match_date field for date operations (for example, adding dates), you don't have to define the first column as a Date. The same is valid for the goals column: You only need to define the column as Integer if you plan to do math with it. In the other case, a String is enough.

  4. Do a preview just to confirm that the step is well configured.
  5. Add a Select values step to select and reorder the columns as follows: team, type, match_date, goals.
  6. From the Transform category of steps add a Sort rows step, and create a link from the Select values step towards this new step.
  7. Double-click on the Sort rows step and configure it, as shown in the following screenshot:
  8. Click on OK.
  9. At the end, add a Dummy step. Your transformation should look like this:
  10. Save the transformation.
  11. Select the last step and do a preview. You should see this:

What just happened?

You read a file with a list of football match results and sorted the rows based on two columns: team (ascendant) and type (descendant). Using this method, your final data was ordered by team. Then, for any given team, the home values were first, followed by the away values. This is due to the ascending flag that is you set N under the Ascending column.

Note that the Select values step is not mandatory here. We just used it for having the columns team and type to the left, so the sorted dataset was easy to read.

Tip

Sorting data

For small datasets, the sorting algorithm runs mainly using the JVM memory. When the number of rows exceeds the specified sort size, it works differently. Suppose that you put 5000 as the value of the Sort size field. Every 5000 rows, the process sorts them and writes them to a temporary file. When there are no more rows, it does a merge sort on all of those files and gives you back the sorted dataset. You can conclude that for huge datasets a lot of reading and writing operations are done on your disk, which slows down the whole transformation. Fortunately, you can change the number of rows in memory (one million by default) by setting a new value in the Sort size (rows in memory) textbox. The bigger this number, the faster the sorting process. In summary, the amount of memory allocated to the process will offset the speed gained, and as soon as the JVM has to start using swap space, the performance will degrade.

Note that a sort size that works in your system may not work in a machine with a different configuration. To avoid that risk you can use a different approach. In the Sort rows configuration window you can set a Free memory threshold (in %) value. The process begins to use temporary files when the percentage of available memory drops below the indicated threshold. The lower the percentage, the faster the process.

You cannot, however, just set a small free memory threshold and expect that everything runs fine

Note

As it is not possible to know the exact amount of free memory, it's not recommended to set a very small free memory threshold. You definitely should not use that option in complex transformations, or when there is more than one sort going on, as you could still run out of memory.

The final steps were added just to preview the result of the transformation. You could have previewed the data by selecting the Sort rows step. The idea, however, is that after this test, you can replace the Dummy step with any of the output steps you already know, or delete it and continue transforming the data.

Tip

You have used the Dummy step several times but still nothing has been said about it. Mainly because it does nothing! However, you can use it as a placeholder for testing purposes, as in the previous exercise.

Have a go hero – listing the last match played by each team

Read the matches.txt file and, as output, generate a file with the following structure and data: one team by row, along with information about its last played football match. The output should be something like this:

team;match_date;goals
Albania;16-10;1
Andorra;16-10;0
Antigua and Barbuda;16-10;1
Armenia;12-10;1
Austria;16-10;4
Azerbaijan;16-10;0
Belarus;16-10;2
Belgium;16-10;2
Bosnia-Herzegovina;16-10;3
Bulgaria;16-10;0
Tip

Use two Sort rows steps. Use the first for sorting as needed. In the second Sort rows step experiment with the flag named Only pass unique rows? (verifies keys only).

As it's not really intuitive, let's briefly explain the purpose of this flag.

Note

The Only pass unique rows? flag filters out duplicate rows leaving only unique occurrences. The uniqueness is forced on the list of fields by which you sorted the dataset. When there are two or more identical rows, only the first is passed to the next step(s).

This flag behaves exactly as a step that we haven't seen, but one that you can try as well: the Unique rows step, which you will find in the Transform category of steps. This step discards duplicate rows and keeps only unique occurrences. In this case, the uniqueness is also forced on a specific list of fields.

主站蜘蛛池模板: 平乡县| 都江堰市| 嵊泗县| 阜新市| 都江堰市| 堆龙德庆县| 栖霞市| 赤水市| 襄垣县| 兴化市| 梁平县| 西宁市| 永仁县| 图木舒克市| 荥经县| 嘉黎县| 富宁县| 西和县| 正宁县| 宜兰县| 麟游县| 仁怀市| 台山市| 邓州市| 乾安县| 瓦房店市| 科技| 巩留县| 漯河市| 锡林浩特市| 遂川县| 卢龙县| 简阳市| 呼玛县| 奈曼旗| 齐齐哈尔市| 乌海市| 临西县| 重庆市| 当涂县| 富民县|