官术网_书友最值得收藏!

Data preparation

What we should do now is create our training and test data using a 70/30 split. Then, we should subject it to the standard feature exploration we started discussing in Chapter 1, Preparing and Understanding Data, with these tasks in mind:

  • Eliminate low variance features
  • Identify and remove linear dependencies
  • Explore highly correlated features

The first thing then is for us to turn the numeric outcome into a factor to be used for creating a stratified data index, like so:

> y_factor <- as.factor(y)

> set.seed(1492)

> index <- caret::createDataPartition(y_factor, p = 0.7, list = F)

Using the index, we create train/test input features and labels:

> train <- x[index, ]

> train_y <- y_factor[index]

> test <- x[-index, ]

> test_y <- y_factor[-index]

With our training data in hand, let's find and eliminate the low variance features, which I can state in advance are quite a few:

> train_NZV <- caret::nearZeroVar(train, saveMetrics = TRUE)

> table(train_NZV$nzv)

FALSE TRUE
48 74

> table(train_NZV$zeroVar)

FALSE TRUE
121 1

We see that 74 features are low variance, and one of those is zero variance. Let's rid ourselves of these pesky features:

> train_r <- train[train_NZV$nzv == FALSE]

Given our new dataframe of reduced features, we now identify and eliminate linear dependency combinations:

> linear_combos <- caret::findLinearCombos(x = train_r)

> linear_combos
$`linearCombos`
$`linearCombos`[[1]]
[1] 13 1 2 3 4 5 9 10 11 12

$`linearCombos`[[2]]
[1] 19 16

$`linearCombos`[[3]]
[1] 20 15

$`linearCombos`[[4]]
[1] 22 1 2 3 4 5 15 16 18 21

$`linearCombos`[[5]]
[1] 40 1 2 3 4 5 39

$`linearCombos`[[6]]
[1] 42 1 2 3 4 5 41

$`linearCombos`[[7]]
[1] 47 1 2 3 4 5 43 44 45 46

$remove
[1] 13 19 20 22 40 42 47

The output provides a list of 7 linear dependencies and recommends the removal of 7 features. The number in $remove corresponds to the column index number in the dataframe. For example, in combination number 2, the indices would be indicative of the column names, V36 and V22. Here's a table of these two features for demonstration purposes:

> table(train_r$V36, train_r$V22)

0 1
0 3032 0
1 0 1459

It's clear these two features are measuring the same thing. We'll remove those recommended, but there's one more thing to discuss. When doing cross-validation during the modeling process, you may run into warnings that linear dependencies exist even though you ran this methodology. I found that to be the case with this dataset in the modeling exercises that follow. After some exploration of features V1 through V5, I found that, by dropping V5, this was no longer a problem. Let's proceed with that in mind:

> train_r <- train_r[, -linear_combos$remove]

> train_r <- train_r[, -5]

> plm::detect_lin_dep(train_r)
[1] "No linear dependent column(s) detected."

Here we can check if there're any correlations over 0.7, and remove a feature if it's highly correlated with another:

> high_corr <- caret::findCorrelation(my_data_cor, cutoff = 0.7)

> high_corr
[1] 29

> train_df <- train_r[, -high_corr]

The code found and removed the feature with a column index of 30 and 34. We now have a dataframe ready for modeling. If you want to look at a correlation heatmap, then run this handy function from the DataExplorer package:

> DataExplorer::plot_correlation(train_df)

The output of the preceding code is as follows:

Notice that features V67 and V71 are highly correlated. In a real-world setting, this would probably warrant further investigation, but we'll feed both into our learning algorithms, as no subject matter expert can tell us otherwise.

We can now proceed with our model training, starting with KNN, then SVM, and comparing their performance.

主站蜘蛛池模板: 于都县| 德清县| 贵州省| 兰考县| 中牟县| 阳东县| 永登县| 霍山县| 锦州市| 高陵县| 油尖旺区| 北安市| 白银市| 焦作市| 紫云| 丹巴县| 襄樊市| 寿阳县| 文成县| 岑溪市| 灵寿县| 浑源县| 榆社县| 浪卡子县| 黔西| 元阳县| 蒲江县| 康马县| 泗水县| 岐山县| 桃园县| 塔河县| 宾阳县| 黔江区| 江西省| 佛坪县| 靖宇县| 壤塘县| 济南市| 扬州市| 高邑县|