- Mastering Machine Learning with R(Second Edition)
- Cory Lesmeister
- 622字
- 2021-07-09 18:23:58
Discriminant analysis application
LDA is performed in the MASS package, which we have already loaded so that we can access the biopsy data. The syntax is very similar to the lm() and glm() functions.
We can now begin fitting our LDA model, which is as follows:
> lda.fit <- lda(class ~ ., data = train)
> lda.fit
Call:
lda(class ~ ., data = train)
Prior probabilities of groups:
benign malignant
0.6371308 0.3628692
Group means:
thick u.size u.shape adhsn s.size nucl
chrom
benign 2.9205 1.30463 1.41390 1.32450 2.11589
1.39735 2.08278
malignant 7.1918 6.69767 6.68604 5.66860 5.50000
7.67441 5.95930
n.nuc mit
benign 1.22516 1.09271
malignant 5.90697 2.63953
Coefficients of linear discriminants:
LD1
thick 0.19557291
u.size 0.10555201
u.shape 0.06327200
adhsn 0.04752757
s.size 0.10678521
nucl 0.26196145
chrom 0.08102965
n.nuc 0.11691054
mit -0.01665454
This output shows us that Prior probabilities of groups are approximately 64 percent for benign and 36 percent for malignancy. Next is Group means. This is the average of each feature by their class. Coefficients of linear discriminants are the standardized linear combination of the features that are used to determine an observation's discriminant score. The higher the score, the more likely that the classification is malignant.
The plot() function in LDA will provide us with a histogram and/or the densities of the discriminant scores, as follows:
> plot(lda.fit, type = "both")
The following is the output of the preceding command:
We can see that there is some overlap in the groups, indicating that there will be some incorrectly classified observations.
The predict() function available with LDA provides a list of three elements: class, posterior, and x. The class element is the prediction of benign or malignant, the posterior is the probability score of x being in each class, and x is the linear discriminant score. Let's just extract the probability of an observation being malignant:
> train.lda.probs <- predict(lda.fit)$posterior[,
2]
> misClassError(trainY, train.lda.probs)
[1] 0.0401
> confusionMatrix(trainY, train.lda.probs)
0 1
0 296 13
1 6 159
Well, unfortunately, it appears that our LDA model has performed much worse than the logistic regression models. The primary question is to see how this will perform on the test data:
> test.lda.probs <- predict(lda.fit, newdata =
test)$posterior[, 2]
> misClassError(testY, test.lda.probs)
[1] 0.0383
> confusionMatrix(testY, test.lda.probs)
0 1
0 140 6
1 2 61
That's actually not as bad as I thought, given the lesser performance on the training data. From a correctly classified perspective, it still did not perform as well as logistic regression (96 percent versus almost 98 percent with logistic regression).
We will now move on to fit a QDA model. In R, QDA is also part of the MASS package and the function is qda(). Building the model is rather straightforward again, and we will store it in an object called qda.fit, as follows:
> qda.fit = qda(class ~ ., data = train)
> qda.fit
Call:
qda(class ~ ., data = train)
Prior probabilities of groups:
benign malignant
0.6371308 0.3628692
Group means:
Thick u.size u.shape adhsn s.size nucl chrom
n.nuc
benign 2.9205 1.3046 1.4139 1.3245 2.1158
1.3973 2.0827 1.2251
malignant 7.1918 6.6976 6.6860 5.6686 5.5000
7.6744 5.9593 5.9069
mit
benign 1.092715
malignant 2.639535
As with LDA, the output has Group means but does not have the coefficients because it is a quadratic function as discussed previously.
The predictions for the train and test data follow the same flow of code as with LDA:
> train.qda.probs <- predict(qda.fit)$posterior[,
2]
> misClassError(trainY, train.qda.probs)
[1] 0.0422
> confusionMatrix(trainY, train.qda.probs)
0 1
0 287 5
1 15 167
> test.qda.probs <- predict(qda.fit, newdata =
test)$posterior[, 2]
> misClassError(testY, test.qda.probs)
[1] 0.0526
> confusionMatrix(testY, test.qda.probs)
0 1
0 132 1
1 10 66
We can quickly tell that QDA has performed the worst on the training data with the confusion matrix, and it has classified the test set poorly with 11 incorrect predictions. In particular, it has a high rate of false positives.
- Spark快速大數(shù)據(jù)分析(第2版)
- Effective Amazon Machine Learning
- 計算機信息技術(shù)基礎(chǔ)實驗與習題
- MongoDB管理與開發(fā)精要
- 業(yè)務數(shù)據(jù)分析:五招破解業(yè)務難題
- 大數(shù)據(jù)時代下的智能轉(zhuǎn)型進程精選(套裝共10冊)
- 數(shù)據(jù)驅(qū)動設(shè)計:A/B測試提升用戶體驗
- 大數(shù)據(jù)架構(gòu)和算法實現(xiàn)之路:電商系統(tǒng)的技術(shù)實戰(zhàn)
- Python金融實戰(zhàn)
- INSTANT Apple iBooks How-to
- 達夢數(shù)據(jù)庫運維實戰(zhàn)
- 從實踐中學習sqlmap數(shù)據(jù)庫注入測試
- Oracle 11g+ASP.NET數(shù)據(jù)庫系統(tǒng)開發(fā)案例教程
- 實現(xiàn)領(lǐng)域驅(qū)動設(shè)計
- 信息融合中估計算法的性能評估