- Mastering Machine Learning with R
- Cory Lesmeister
- 623字
- 2021-07-02 13:46:24
Feature selection
What we're going to do now is use the Information package to calculate the IVs for our features. Then, I'll show you how to evaluate those values and run some plots as well. Since there are no hard and fast rules about thresholds for feature inclusion, I'll provide my judgment about where to draw the line. Of course, you can reject that and apply your own.
In this example, the code will create a series of tables you can use to explore the results. To get started, you only need to specify the data and the response or "y" variable:
IV <- Information::create_infotables(data = train, y = "y", parallel = FALSE)
This will give us an IV summary of the top 25 features:
> knitr::kable(head(IV$Summary, 25))
| |Variable | IV|
|:---|:--------|------:|
|2 |V2 | 0.7006|
|102 |V103 | 0.5296|
|124 |V125 | 0.5281|
|45 |V45 | 0.5273|
|31 |V31 | 0.5213|
|125 |V126 | 0.4507|
|55 |V55 | 0.3135|
|140 |V141 | 0.0982|
|108 |V109 | 0.0711|
|130 |V131 | 0.0681|
|33 |V33 | 0.0672|
|104 |V105 | 0.0640|
|66 |V66 | 0.0519|
|92 |V93 | 0.0519|
|128 |V129 | 0.0499|
|121 |V122 | 0.0461|
|24 |V24 | 0.0417|
|131 |V132 | 0.0365|
|34 |V34 | 0.0323|
|47 |V47 | 0.0323|
|123 |V124 | 0.0289|
|129 |V130 | 0.0194|
|83 |V84 | 0.0189|
|19 |V19 | 0.0181|
|35 |V35 | 0.0181|
The results show us the feature column number, the feature name, and the IV. Notice that we have five features that are possibly suspicious. I'm all for taking any feature with an IV above 0.02, which is the bottom of the weak predictors. That will give us 21 input features. The V2 feature is interesting. If you look at the values and think about the data, it seems clear that it's the customer's age. Let's see how the data is binned, the WOE values, and the IVs:
> knitr::kable(IV$Tables$V2)
|V2 | N| Percent| WOE| IV|
|:--------|-----:|-------:|-------:|------:|
|[5,22] | 951 | 0.0156 | 0.0000 | 0.0000|
|[23,23] | 16222| 0.2667 | -1.6601| 0.3705|
|[24,24] | 4953 | 0.0814 | -1.2811| 0.4481|
|[25,26] | 6048 | 0.0994 | -0.7895| 0.4919|
|[27,31] | 8088 | 0.1330 | 0.2261 | 0.4994|
|[32,36] | 6037 | 0.0993 | 0.4923 | 0.5297|
|[37,42] | 6302 | 0.1036 | 0.6876 | 0.5975|
|[43,51] | 6095 | 0.1002 | 0.7328 | 0.6737|
|[52,105] | 6120 | 0.1006 | 0.4636 | 0.7006|
OK, you've got to be kidding me. Look at bin number 2, which I believe is customer age of 23 years. It constitutes almost 27 percent of the total observations and contributes over half of the IV. Suspicious indeed! How is any algorithm we produce on this data going to help if this feature is genuine AGE as I suspect? However, that's outside the scope of this endeavor and not worth wasting any more time or effort. Here we can quickly bring up a bar plot of the WOEs by bin:
> Information::plot_infotables(IV, "V2", show_values = TRUE)
The output of the preceding code is as follows:

Interesting that there's a somewhat linear relationship between this feature and the response. What can be done is we can create features that turn the binned values into the WOE values. These new features would be linear and could be used in place of the original features. We shall forgo that because what method will do that for us? That's right, MARS in the next section can do that for us! Here is a grid plot of the top four features:
> Information::plot_infotables(IV, IV$Summary$Variable[1:4], same_scales=TRUE)
The output of the preceding code is as follows:

Now, given the cutoff point I picked previously, we can select those 21 features:
> features <- IV$Summary$Variable[1:21]
> train_reduced <- train[, colnames(train) %in% features]
> train_reduced$y <- train$y
There you go. We're now ready to begin training our algorithm.
- R Data Mining
- 協(xié)作機器人技術及應用
- Java實用組件集
- UTM(統(tǒng)一威脅管理)技術概論
- 21天學通ASP.NET
- 西門子S7-200 SMART PLC實例指導學與用
- 我也能做CTO之程序員職業(yè)規(guī)劃
- Red Hat Linux 9實務自學手冊
- LMMS:A Complete Guide to Dance Music Production Beginner's Guide
- Linux內核精析
- INSTANT Puppet 3 Starter
- 中國戰(zhàn)略性新興產(chǎn)業(yè)研究與發(fā)展·數(shù)控系統(tǒng)
- 算法設計與分析
- 大數(shù)據(jù)時代的調查師
- 軟測之魂