官术网_书友最值得收藏!

Multivariate adaptive regression splines

In the prior chapter, we went through a discussion on MARS, how it works, why use it, and so on, so I won't duplicate that here; other than that, it can be applied in a classification problem as a generalized linear model. One of the key benefits is its power to conduct feature selection, so there's no need to run stepwise or IV—or even regularization, for that matter.

We'll train it with 5-fold cross-validation and set nprune = 15 to limit the maximum number of features at 15. Recall from the previous chapter that more than 15 terms are possible as it fits piecewise splines.

This code will give us our model object. Be advised that this may take some time to complete:

> set.seed(1972)

> earth_fit <-
earth::earth(
x = train[, -142],
y = train[, 142],
pmethod = 'cv',
nfold = 5,
degree = 1,
minspan = -1,
nprune = 15,
glm = list(family = binomial)
)

Here's the model summary:

> summary(earth_fit)
Call: earth(x=train[,-142], y=train[,142], pmethod="cv",
glm=list(family=binomial), degree=1, nprune=15, nfold=5,
minspan=-1)

GLM coefficients
y
(Intercept) -3.4407
V23 -6.6750
V24 -1.3539
V105 -0.8200
h(28-V2) -0.4769
h(V2-28) 0.0071
h(1-V31) 1.4876
h(V31-1) 0.0947
h(106449-V141) 0.0000
h(V141-106449) 0.0000

Earth selected 10 of 10 terms, and 6 of 141 predictors using pmethod="cv"

As you can see in the summary, the model ended up with six total predictive features and a total of ten terms, including a hinge function on V2. By standard protocol, the paired hinge terms can be read first predictor less than the hinge value, and then predictor greater than or equal to hinge. For instance, for V31, a value less than 1 has a coefficient of 1.4876, otherwise 0.0947.

We can plot the linear interactions with respect to predicted probability. Setting ylim to NA helps to show changes in y (predicted probability) versus changes in feature values:

> plotmo::plotmo(earth_fit, ylim = NA)

The output of the preceding code is as follows:

Notice how, for V31, values equal to zero have one coefficient else another one as described previously. Feature importance is trivial to produce:

> earth::evimp(earth_fit)
nsubsets gcv rss
V31 9 100.0 100.0
V2 8 74.8 75.0
V141 7 40.2 41.1
V105 6 31.5 32.5
V23 5 26.8 27.8
V24 4 20.7 21.8

The nsubsets criterion counts the number of model subsets that include the feature. Features that included in more subsets are considered more important. These subsets are of the terms generated by earth's pruning pass.

Now, we specified cross-validation, but earth concurrently does forward and backward feature selection and elimination, using Generalized Cross-Validation (GCV) as discussed in the previous chapter. So, GCV and rss results for a feature are normalized from 0 to 100 for comparison purposes.

Like we did previously, a plot of the probability densities is useful, and earth comes with its own plotd() function:

I like how the predicted values are reversed compared to the prior plots. Other than that, it's hard to discern anything meaningful with the exception that the densities are quite similar. Let's get the cutoff value:

> mars_cutoff <-
InformationValue::optimalCutoff(
train$y,
pred,
optimiseFor = 'Both',
returnDiagnostics = TRUE
)

Examination of the object provides the following:

  • Optimal cutoff = 0.04976
  • TPR = 0.6449
  • FPR = 0.208

In comparison with logistic regression, we have a higher rate of finding true positives at the expense of a slightly higher rate of false positives.

Let's move on to evaluating performance on the test set:

> test_pred <- predict(earth_fit, test, type = 'response')

> Metrics::auc(test$y, test_pred)
[1] 0.8079

> Metrics::logLoss(test$y, test_pred)
[1] 0.1406

What do we see here? A slight improvement in AUC, and a lower (better) log-loss. While not dramatic, it may be of value. We can now turn to visually comparing the two models to confirm that MARS is indeed the preferred algorithm.

主站蜘蛛池模板: 县级市| 大名县| 樟树市| 新野县| 松滋市| 浑源县| 大丰市| 浦江县| 合江县| 阿克陶县| 滨州市| 平舆县| 岳阳县| 卫辉市| 萍乡市| 前郭尔| 炉霍县| 天峨县| 永康市| 台中市| 焦作市| 五常市| 平果县| 淮安市| 双峰县| 黎城县| 濮阳县| 桂阳县| 普安县| 凤城市| 汶上县| 玉林市| 托克逊县| 永昌县| 雅安市| 昔阳县| 增城市| 普宁市| 云南省| 开鲁县| 永仁县|