官术网_书友最值得收藏!

Multivariate adaptive regression splines

In the prior chapter, we went through a discussion on MARS, how it works, why use it, and so on, so I won't duplicate that here; other than that, it can be applied in a classification problem as a generalized linear model. One of the key benefits is its power to conduct feature selection, so there's no need to run stepwise or IV—or even regularization, for that matter.

We'll train it with 5-fold cross-validation and set nprune = 15 to limit the maximum number of features at 15. Recall from the previous chapter that more than 15 terms are possible as it fits piecewise splines.

This code will give us our model object. Be advised that this may take some time to complete:

> set.seed(1972)

> earth_fit <-
earth::earth(
x = train[, -142],
y = train[, 142],
pmethod = 'cv',
nfold = 5,
degree = 1,
minspan = -1,
nprune = 15,
glm = list(family = binomial)
)

Here's the model summary:

> summary(earth_fit)
Call: earth(x=train[,-142], y=train[,142], pmethod="cv",
glm=list(family=binomial), degree=1, nprune=15, nfold=5,
minspan=-1)

GLM coefficients
y
(Intercept) -3.4407
V23 -6.6750
V24 -1.3539
V105 -0.8200
h(28-V2) -0.4769
h(V2-28) 0.0071
h(1-V31) 1.4876
h(V31-1) 0.0947
h(106449-V141) 0.0000
h(V141-106449) 0.0000

Earth selected 10 of 10 terms, and 6 of 141 predictors using pmethod="cv"

As you can see in the summary, the model ended up with six total predictive features and a total of ten terms, including a hinge function on V2. By standard protocol, the paired hinge terms can be read first predictor less than the hinge value, and then predictor greater than or equal to hinge. For instance, for V31, a value less than 1 has a coefficient of 1.4876, otherwise 0.0947.

We can plot the linear interactions with respect to predicted probability. Setting ylim to NA helps to show changes in y (predicted probability) versus changes in feature values:

> plotmo::plotmo(earth_fit, ylim = NA)

The output of the preceding code is as follows:

Notice how, for V31, values equal to zero have one coefficient else another one as described previously. Feature importance is trivial to produce:

> earth::evimp(earth_fit)
nsubsets gcv rss
V31 9 100.0 100.0
V2 8 74.8 75.0
V141 7 40.2 41.1
V105 6 31.5 32.5
V23 5 26.8 27.8
V24 4 20.7 21.8

The nsubsets criterion counts the number of model subsets that include the feature. Features that included in more subsets are considered more important. These subsets are of the terms generated by earth's pruning pass.

Now, we specified cross-validation, but earth concurrently does forward and backward feature selection and elimination, using Generalized Cross-Validation (GCV) as discussed in the previous chapter. So, GCV and rss results for a feature are normalized from 0 to 100 for comparison purposes.

Like we did previously, a plot of the probability densities is useful, and earth comes with its own plotd() function:

I like how the predicted values are reversed compared to the prior plots. Other than that, it's hard to discern anything meaningful with the exception that the densities are quite similar. Let's get the cutoff value:

> mars_cutoff <-
InformationValue::optimalCutoff(
train$y,
pred,
optimiseFor = 'Both',
returnDiagnostics = TRUE
)

Examination of the object provides the following:

  • Optimal cutoff = 0.04976
  • TPR = 0.6449
  • FPR = 0.208

In comparison with logistic regression, we have a higher rate of finding true positives at the expense of a slightly higher rate of false positives.

Let's move on to evaluating performance on the test set:

> test_pred <- predict(earth_fit, test, type = 'response')

> Metrics::auc(test$y, test_pred)
[1] 0.8079

> Metrics::logLoss(test$y, test_pred)
[1] 0.1406

What do we see here? A slight improvement in AUC, and a lower (better) log-loss. While not dramatic, it may be of value. We can now turn to visually comparing the two models to confirm that MARS is indeed the preferred algorithm.

主站蜘蛛池模板: 潼南县| 兰考县| 隆昌县| 保康县| 雷州市| 伊吾县| 庆阳市| 马边| 永胜县| 湖口县| 枣强县| 永胜县| 澜沧| 岑溪市| 资中县| 神池县| 台中县| 庆云县| 西和县| 渑池县| 武安市| 新密市| 隆尧县| 正定县| 东台市| 乌鲁木齐市| 海盐县| 哈巴河县| 湘阴县| 沐川县| 林西县| 平远县| 芦溪县| 建湖县| 竹山县| 双鸭山市| 锦州市| 鄱阳县| 云林县| 日喀则市| 辽阳县|