- Predictive Analytics Using Rattle and Qlik Sense
- Ferran Garcia Pagans
- 1790字
- 2021-07-16 13:40:19
Transforming data
Data transformation and exploratory data analysis are two iterative steps. The objective is to improve the data quality to create a more accurate model. In order to transform your data, you need to understand it first. So, in real life, you can explore and transform iteratively until you are fine with your data.
For simplicity, we'll cover data transformation in this chapter and data exploration in the next chapter.
Data mining experts usually spend a lot of time preparing data before they start modeling. Preparing data is not as glamorous as creating predictive models but it has a great impact in the model performance. So, be patient and spend time to create a good dataset.
When we execute a transformation in a variable, Rattle doesn't modify the original variable. Rattle creates a new variable with a prefix that indicates the performed transformation and the name of the original variable. An example can be seen in the following screenshot:

We see the list of variables contained in Rattle after applying a rank transformation to the variable fare.
Transforming data with Rattle
Rattle's Transform tab offers four different types of transformations:
- Rescale
- Impute
- Recode
- Cleanup
These transformation options are shown in the following screenshot:

Rescaling data
In real life, measures use different scales; for example, in the Titanic passenger list, the minimum value for the variable Age is 0.42 and the maximum value is 80. For the variable Fare, the minimum is 0 and the maximum is 512.3. For this reason, a difference of 10 is a big difference for the variable Age and a small difference for Fare. Some algorithms and techniques need all variables with the same scale, and we need to adjust values measured on different scales to a common scale. Rescaling is the process of adjusting the numeric values of a variable to a different scale.
In Rattle, the Rescale option has two sub-options—Normalize and Order. To Normalize variables means to modify the values of the different observations to fit into a scale. The most common normalization is Scale [0-1]. If we apply this option to a variable, Rattle will map its values between 0 and 1. In the following table, we've used five values of the variable Age to create an example. As we've seen, the minimum value is 0 and the maximum 80. Rescaling the variable from 0 to 1, the minimum value is mapped to 0 and the maximum to 1. The intermediate values are mapped in between 0 and 1, as shown in the following table:

Rattle provides two different Order transformations—Rank and Interval. With the Rank option, Rattle will convert variable values to a rank assigning 1 to the minimum value of the variable. We use this option when we're more interested in the relative position of value in the distribution than in the real value.
In our example, the first value of the variable age is 0.42 with the first position in our rank. The second position in the rank is for the value 0.67, and the third and fourth position in the rank has the same value, 0.75. In this case, Rattle doesn't use position three and four, it uses 3.5, as shown in the following table:

Finally, Interval groups the values into different groups. Use the input box to choose the number of groups you want to create, as shown in the following screenshot:

Using Interval and the value 10 for Number of Groups, Rattle will create 10 groups and will label the groups 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Depending on their value, Rattle will map each observation to a different group. The minimum value 0.4 will be in the group 0; 80, the maximum value, will be mapped to group 9.
Using the Impute option to deal with missing values
Sometimes, you will have incomplete observations with missing values for some variables. There are different reasons for missing values. Sometimes, data is manually collected, and not everybody collects it with the same accuracy. Sometimes data is collected from many sensors, and one of them could be temporarily out of order.
Detecting missing values could be difficult. In R, the value NA, which means Not Available, indicates a missing value, but there are a lot of data sources that codify a missing value with a concrete value. For numeric values, 0 or 99999 could identify a missing value. You'll need to explore your data carefully to find the real missing values. As we have seen, in the Titanic dataset, variables Age, Cabin, and Embarked have missing values.
With the Impute option, we can choose how we want to fill the missing values in our variables, as shown in this screenshot:

Rattle allows us to apply the following transformations to the missing values:
- Zero/Missing: Using this option, Rattle will replace all missing values in a numeric variable with 0 and missing values in a categorical variable with Missing.
- Mean: This option will use the mean to fill missing values in a numeric variable. Of course, we cannot use this option with a categorical variable.
- Median: With this option, we can replace missing values with the median. As with the Mean option, this option can only be used with numeric variables.
- Mode: Using the Mode option, Rattle will replace missing values with the most frequent value. This option can be used with both numeric as well as categorical variables.
- Constant: This option will allow us to enter a constant value to replace missing values. Like the Mode option, we can use it with both numeric as well as categorical variables.
Rattle has five different options, and if you need to use a different approach, you'll need to code in R, or fill the missing values before loading data into Rattle.
Now you probably must be thinking that the Median, Mean, and Mode options are very similar, and you don't know how to choose among the three different options. To choose one of these options, we need to see how values are distributed into the different observations. We'll see, in the next chapter, that the histogram is the best plot to see the value distribution in a variable, and you'll learn how to plot a histogram using Rattle.
To understand how to fill the missing values, you can analyze the histogram of the original variable, then apply a transformation and analyze the new histogram. With the example of the variable Age, we've created a histogram with the original variable (left-hand side). We've applied a Zero imputation and created a new histogram. When we apply a Zero imputation, we fill those values with all missing values. You will have something like the following graph:

The histogram on the left shows the shape of the original variable Age; the mean is 29.7 years. In the Titanic dataset, the variable Age has 177 missing values. During the imputation, these 177 values are set to 0. This moves the mean of the distribution to 23.8. In this case, you can see a lot of people with 0 years. As we'll see, the performance of some techniques or algorithms can be affected by this change in the distribution shape.
Now, we can apply Mean imputation (fill the missing values with the mean), Median imputation (fill the missing values with the median), or Mode imputation (fill the missing values with mode).
These three screenshots show the distribution of the Age variable histogram after applying a Mean imputation (upper), a Median imputation (middle), and a Mode imputation (lower):

Additionally, you have to consider deleting all observations with missing values in the variable Age. This variable has 177 missing values in 891 observations; filling the gaps with a fixed value will probably produce a bad performance.
Recoding variables
We use the Recode option to transform the values of variables by distributing the values into different bins or by changing the type of the variable.
Binning
Some models and algorithms only work with categorical variables. Binning is an operation that can be useful to transform a numeric variable into a categorical variable. The original values that fall in a bin take the value that represents that bin.
This is how we bin a variable:
- Divide the range of values into a series of small intervals or bins
- Distribute each value into its interval or bin
To define the groups or bins, we have three options:
- Use Quantiles to create groups with the same number of observations
- Use KMeans to create groups of members based on the distance of the values
- Choose Equal Width to distribute the values of a variable into groups of the same width, as shown in this screenshot:
Like in this screenshot, try to apply an Equal With transformation (under Binning) to the variable age. Rattle will create 10 groups and will place each observation in a group.
To distribute values into different groups, you can also select Type as Rescale, and then Order as Interval and set Number of Groups as 10, as shown in the following screenshot:

What is the difference between the two options? The variable Age is a numeric variable; when you use Recode, the result is a numeric variable. If you use Binning, the new variable is a categorical variable, as shown in this screenshot:

In the previous screenshot, we created RIN_Age_10 using rescale and BE10_Age using binning.
Binning could also be used to reduce small observation errors. By replacing the original value by a representative value of the group, you will reduce the effect of small observation errors.
Indicator variables
As opposed to the previous section some algorithms (like many clustering models) only work with numeric variables. A simple technique to convert categorical variables into numeric variables is indicator variables. Take a categorical variable like Level with three categories—Beginner, Medium, and Advanced—and create three new variables called Beginner indicator, Medium indicator, and Advanced indicator. If the value of Level is Beginner, set variable Beginner indicator to 1 and the rest to 0, as shown in this diagram:

In Rattle, the Transform tab has an Indicator Variable option. In order to apply this transformation, select the variable (in this case, Level), select Indicator Variable, and click on Execute, as shown in the following screenshot. Rattle will create a variable for each category belonging to the categorical variable:

Join Categories
With the Join Categories option, Rattle will convert two categorical variables into a single one. In the following table, we've used Rattle to convert Level and Sex to a single variable:

As Category
Using the As Category option, you can convert numeric variables into categorical.
As Numeric
Using the As Numeric option, Rattle will convert categorical variable into numeric.