- Advanced Machine Learning with R
- Cory Lesmeister Dr. Sunil Kumar Chinnamgari
- 457字
- 2021-06-24 14:24:41
Data preparation
What we should do now is create our training and test data using a 70/30 split. Then, we should subject it to the standard feature exploration we started discussing in Chapter 1, Preparing and Understanding Data, with these tasks in mind:
- Eliminate low variance features
- Identify and remove linear dependencies
- Explore highly correlated features
The first thing then is for us to turn the numeric outcome into a factor to be used for creating a stratified data index, like so:
> y_factor <- as.factor(y)
> set.seed(1492)
> index <- caret::createDataPartition(y_factor, p = 0.7, list = F)
Using the index, we create train/test input features and labels:
> train <- x[index, ]
> train_y <- y_factor[index]
> test <- x[-index, ]
> test_y <- y_factor[-index]
With our training data in hand, let's find and eliminate the low variance features, which I can state in advance are quite a few:
> train_NZV <- caret::nearZeroVar(train, saveMetrics = TRUE)
> table(train_NZV$nzv)
FALSE TRUE
48 74
> table(train_NZV$zeroVar)
FALSE TRUE
121 1
We see that 74 features are low variance, and one of those is zero variance. Let's rid ourselves of these pesky features:
> train_r <- train[train_NZV$nzv == FALSE]
Given our new dataframe of reduced features, we now identify and eliminate linear dependency combinations:
> linear_combos <- caret::findLinearCombos(x = train_r)
> linear_combos
$`linearCombos`
$`linearCombos`[[1]]
[1] 13 1 2 3 4 5 9 10 11 12
$`linearCombos`[[2]]
[1] 19 16
$`linearCombos`[[3]]
[1] 20 15
$`linearCombos`[[4]]
[1] 22 1 2 3 4 5 15 16 18 21
$`linearCombos`[[5]]
[1] 40 1 2 3 4 5 39
$`linearCombos`[[6]]
[1] 42 1 2 3 4 5 41
$`linearCombos`[[7]]
[1] 47 1 2 3 4 5 43 44 45 46
$remove
[1] 13 19 20 22 40 42 47
The output provides a list of 7 linear dependencies and recommends the removal of 7 features. The number in $remove corresponds to the column index number in the dataframe. For example, in combination number 2, the indices would be indicative of the column names, V36 and V22. Here's a table of these two features for demonstration purposes:
> table(train_r$V36, train_r$V22)
0 1
0 3032 0
1 0 1459
It's clear these two features are measuring the same thing. We'll remove those recommended, but there's one more thing to discuss. When doing cross-validation during the modeling process, you may run into warnings that linear dependencies exist even though you ran this methodology. I found that to be the case with this dataset in the modeling exercises that follow. After some exploration of features V1 through V5, I found that, by dropping V5, this was no longer a problem. Let's proceed with that in mind:
> train_r <- train_r[, -linear_combos$remove]
> train_r <- train_r[, -5]
> plm::detect_lin_dep(train_r)
[1] "No linear dependent column(s) detected."
Here we can check if there're any correlations over 0.7, and remove a feature if it's highly correlated with another:
> high_corr <- caret::findCorrelation(my_data_cor, cutoff = 0.7)
> high_corr
[1] 29
> train_df <- train_r[, -high_corr]
The code found and removed the feature with a column index of 30 and 34. We now have a dataframe ready for modeling. If you want to look at a correlation heatmap, then run this handy function from the DataExplorer package:
> DataExplorer::plot_correlation(train_df)
The output of the preceding code is as follows:
Notice that features V67 and V71 are highly correlated. In a real-world setting, this would probably warrant further investigation, but we'll feed both into our learning algorithms, as no subject matter expert can tell us otherwise.
We can now proceed with our model training, starting with KNN, then SVM, and comparing their performance.
- 零點起飛學Xilinx FPG
- Windows phone 7.5 application development with F#
- 電腦軟硬件維修大全(實例精華版)
- 3ds Max Speed Modeling for 3D Artists
- 平衡掌控者:游戲數值經濟設計
- The Deep Learning with Keras Workshop
- 嵌入式系統中的模擬電路設計
- 微服務分布式架構基礎與實戰:基于Spring Boot + Spring Cloud
- 筆記本電腦使用、維護與故障排除從入門到精通(第5版)
- Source SDK Game Development Essentials
- Spring Cloud微服務和分布式系統實踐
- 單片微機原理及應用
- 計算機電路基礎(第2版)
- 微服務實戰(Dubbox +Spring Boot+Docker)
- 微控制器的應用