Time for action – finding out which language people speak
An International Musical Contest will take place and 24 countries will participate; each presenting a duet. Your task is to hire interpreters so the contestants can communicate in their native language. In order to do that, you need to find out the language they speak:
Create a new transformation.
By using a Get data from XML step, read the file with information about countries that you used in Chapter 3, Manipulating Real-world Data, countries.xml.
Tip
To avoid configuring the step again, you can open the transformation that reads this file, copy the Get data from XML step, and paste it here.
Drag to the canvas a Filter rows step.
Create a hop from the Get data from XML step to the Filter rows step.
Edit the Filter rows step and create this condition: isofficial= T.
Click on the Filter rows step and do a preview. The list of previewed rows will show the countries along with the official languages, as shown in the following screenshot:
Now let's create the main flow of data.
From the Packt Publishing website, www.packtpub.com, download the list of contestants. It looks like this:
In the same transformation, drag to the canvas a Text file input step and read the downloaded file.
Tip
The ID and Country fields have values only in the first of the two lines for each country. In order to repeat the values in the second line, use the flag Repeat in the Fields tab. Set it to Y.
Expand the Lookup category of steps.
Drag to the canvas a Stream lookup step.
Create a hop from the Text file input you just created, to the Stream lookup step.
Create another hop from the Filter rows step to the Stream lookup step. When asked for the kind of hop, choose Main output of step. So far you have this:
Edit the Stream lookup step by double-clicking on it.
In the Lookup step drop-down list, select Filter official languages, the step that brings the list of languages.
Fill in the grids in the configuration window as follows:
Note
Note that Country Name is a field coming from the text file stream, while the country field comes from the countries stream.
Click on OK.
The hop that goes from the Filter rows step to the Stream lookup step changes its look and feel. The icon that appears over the hop shows that this is the stream where the Stream lookup step is going to look up, as shown in the following screenshot:
By using a Select values step, rename the fields Duet, Country Name, and language, to Name, Country, and Language.
Drag to the canvas a Text file output step and create the file people_and_languages.txt with the selected fields.
Save the transformation.
Run the transformation and check the final file that should look like this:
Name|Country|Language
Mikhail Davydova|Russia|
Anastasia Davydova|Russia|
Carmen Rodriguez|Spain|Spanish
Francisco Delgado|Spain|Spanish
Natsuki Harada|Japan|Japanese
Emiko Suzuki|Japan|Japanese
Lin Jiang|China|Chinese
Wei Chiu|China|Chinese
Chelsea Thompson|United States|English
Cassandra Sullivan|United States|English
Mackenzie Martin|Canada|French
Nathan Gauthier|Canada|French
Giovanni Lombardi|Italy|Italian
Federica Lombardi|Italy|Italian
What just happened?
First of all, you read a file with information about countries and the languages spoken in those countries.
Then, you read a list of people along with the country they come from. For every row in this list, you told Kettle to look for the country (the Country Name field), in the countries stream (the country field), and to give you back a language and the percentage of people who speak that language (language and percentage fields). Let's explain it with a sample row: the row for Francisco Delgado from Spain. When this row gets to the Stream lookup step, Kettle looks in the list of countries for a row with the country Spain. It finds it. Then, it returns the value of the columns language and percentage, Spanish and 74.4.
Now take another sample row: the row with the country Russia. When the row gets to the Stream lookup step, Kettle looks for it in the list of countries, but it does not find it. So, what you get as language is a null string.
Whether the country is found or not, two new fields are added to your stream: the fields language and percentage.
Finally, you exported all of the information to a plain text file.
The Stream lookup step
The Stream lookup step allows you to look up data in a secondary stream.
You tell Kettle which of the incoming streams is the stream used to look up, by selecting the right choice in the Lookup step list.
The upper grid in the configuration window allows you to specify the names of the fields that are used to look up.
In the left column, Field, you indicate the field of your main stream. You can fill in this column by using the Get Fields button, and deleting all the fields you don't want to use for the search.
In the right column, LookupField, you indicate the field of the secondary stream.
When a row of data comes to the step, a lookup is made to see if there is a row in the secondary stream for which, for every pair in the upper grid, the value of Field is equal to the value of LookupField. If there is one, the lookup will be successful.
In the lower grid, you specify the names of the secondary stream fields that you want back as result of the lookup. You can fill in this column by using the Get lookup fields button, and deleting all the fields that you don't want to retrieve.
After the lookup, new fields are added to your dataset—one for every row of this grid.
For the rows for which the lookup is successful, the values for the new fields will be taken from the lookup stream.
For the others, the fields will remain null, unless you set a default value.
It's important that you are aware of the behavior of this step:
Only one row is returned per key. If the key you are looking for appears more than once in the lookup stream, only one will be returned. As an example, think that when there is more than one official language spoken in a country, you get just one. Sometimes you don't care, but on some occasions this is not acceptable and you have to try some other methods.
There is a possible solution to this drawback in the following Have a go hero – selecting the most popular of the official languages section. You will also learn other ways to overcome this situation later in the book.
Have a go hero – selecting the most popular of the official languages
As already discussed, when a country has more than one official language, the lookup step picks any of them. Take for example, the contestant for Canada. Canada has two official languages, and English the most frequently used (60.4%). However, the lookup step returned French. So, the proposal here is to change the transformation in one of the following ways (or both—create two different transformations—if you really want to practice!):
Alter the countries stream, so for each country only the most relevant official language is considered.
Tip
Use a combination of a Sort rows and a Group by step. As Type, use First value or Last value.
Alter the countries stream so that for each country instead of one official language by row, there is a single row with the list of official languages concatenated by -.
Tip
Use a Group by step. As Type use Concatenate strings separated by, and use the Value column for typing the separator.
Have a go hero – counting words more precisely
The section where you counted the words in a file worked pretty well, but you may have noticed that it has some details you can fix or enhance.
You discarded a very small list of words, but there are many more that are common in English, such as prepositions, pronouns, and auxiliary verbs. So here is the challenge: get a list of commonly used words and save it in a file. Instead of excluding words from a small list as you did with a Filter rows step, exclude the words that are present in your words file.
Tip
Use a combination of a Stream lookup step and a Filter rows step, which discards the words if they were found in the words file.
Data cleaning
Data from the real world is not always as perfect as we would like it to be. On one hand, there are cases where the errors in data are so critical that the only solution is to report them or even abort a process. There is, however, a different kind of issue with data: minor problems that can be fixed somehow, as in the following examples:
You have a field that contains years. Among the values, you see 2912. This can be considered as a typo, and assume that the proper value is 2012.
You have a string that represents the name of a country, and it is supposed that the names belong to a predefined list of valid countries. You see, however, values as USA, U.S.A., or United States. In your list, you have only USA as valid, but it is clear that all of these values belong to the same country, and should be easy to unify.
You have a field that should contain integer numbers between 1 and 5. Among the values, you have numbers such as 3.01 or 4.99. It should not be a problem to round those numbers so the final values are all in the expected range of values.
In the following section, you will practice some of this cleansing task.
Time for action – fixing words before counting them
In this section, we will modify the transformation that counted words. We will clean the field words by removing leading characters.
Open the transformation that counted words and save it with a different name.
Delete (or disable) all the steps after the Group by step.
After the Group by step, add a Filter rows step with this condition: word STARTS WITH albite.
Do a preview on this step. You will see this:
Now, from the Transform category of steps, drag a Replace in string step to the work area.
Insert the step between the Select row step and the first Filter rows step, as shown in the following screenshot:
Double-click on the step and configure it. Under In stream field, type word. Under use RegEx, type or select Y. Under Search, type [\.,:]$.
Click on OK.
Repeat the preview on the recently added Filter rows step. You will see this:
What just happened?
You modified the transformation that counted words by cleaning some words.
In the source file, you had several occurrences of the word albite, but some of them had a leading character as., ,; or :. This caused the transformation to consider them as different words, as you saw in the first preview. Simply by using the Replace in string step you removed those symbols, and then all the occurrences of the word were grouped together, which lead to a more precise final count.
Let's briefly explain how the Replace in string step works. The function of the step is to take a field and replace its value with all of the occurrences of a string with a different string. In this case, you wanted to modify the word field by deleting the leading symbols (.,;:); in other words, replacing them with null.
In order to tell Kettle which string to replace, you provided a regular expression: [\.,:]$ This expression represents any of the following characters .,: at the end of the field and not in another place.
In order to remove these symbols, you left the Replace with column empty.
By leaving the Out stream field column empty, you have overwritten the field with the new value.
Cleansing data with PDI
While validation means mainly rejecting data, data cleansing detects and tries to fix not only invalid data, but also the data that is considered illegal or inaccurate in a specific domain.
Data cleansing, also known as data cleaning or data scrubbing, may be done manually or automatically depending on the complexity of the cleansing.
Knowing in advance the rules that apply, you can do an automatic cleaning by using any PDI step that suits. The following are some steps particularly useful:
For examples that use these steps or for getting more information about them, please refer to Appendix D, Job Entries and Steps Reference.
Have a go hero – counting words by cleaning them first
If you take a look at the results in the previous section, you may notice that some words appear more than one in the final list, because of special signs such as ., ), or ", or because of lower or uppercase letters. Look, for example, how many times the word rock appears: rock (99 occurrences), rock,(51 occurrences), rock. (11 occurrences), rock." (1 occurrence), rock: (6 occurrences), rock; (2 occurrences). You can fix it and make the word rock to appear only once. Before grouping the words, remove all extra signs and convert all of the words to lower or uppercase, so they are grouped as expected.
Tip
Try one or more of the following steps: Formula, Calculator, User Defined Java Expression, or any of the steps mentioned in the preceding table.
Summary
In this chapter, we learned some of the most used and useful ways of transforming data.
Specifically, you learned about filtering and sorting data, calculating statistics on groups of rows, and looking up data.
You also learned what data cleansing is about. After learning about the basic manipulation of data, you may now create more complex transformations, where the streams begin to split and merge. That is the core subject of the next chapter.