書名： Statistics for Machine Learning
作者名： Pratap Dangeti
本章字數： 1001字
更新時間： 2021-07-02 19:06:01

Terminology involved in logistic regression

Logistic regression is favorite ground for many interviewers to test the depth of an analyst with respect to their statistical acumen. It has been said that, even if someone understands 1,000 concepts in logistic regression, there would always be a question 1,001 from an interviewer. Hence, it would really be worth building knowledge on logistic regression from its fundamentals in order to create a solid foundation:

Information value (IV): This is very useful in the preliminary filtering of variables prior to including them in the model. IV is mainly used by industry for eliminating major variables in the first step prior to fitting the model, as the number of variables present in the final model would be about 10. Hence, initial processing is needed to reduce variables from 400+ in number or so.

Example: In the following table, continuous variable (price) has been broken down into deciles (10 bins) based on price range and the counted number of events and non-events in that bin, and the information value has been calculated for all the segments and added together. We got the total value as 0.0356, meaning it is a weak predictor to classify events.

Akaike information criteria (AIC): This measures the relative quality of a statistical model for a given set of data. It is a trade-off between bias versus variance. During a comparison between two models, the model with less AIC is preferred over higher value.

If we closely observe the below equation, k parameter (the number of variables included in the model) is penalizing the overfitting phenomena of the model. This means that we can artificially prove the training accuracy of the model by incorporating more not so significant variables in the model; by doing so, we may get better accuracy on training data, but on testing data, accuracy will decrease. This phenomenon could be some sort of regularization in logistic regression:

AIC = -2*ln(L) + 2*k

L = Maximum value of Likelihood (log transformation applied for mathematical convenience)

k = Number of variables in the model

Receiver operating characteristic (ROC) curve: This is a graphical plot that illustrates the performance of a binary classifier as its discriminant threshold is varied. The curve is created by plotting true positive rate (TPR) against false positive rate (FPR) at various threshold values.

A simple way to understand the utility of the ROC curve is that, if we keep the threshold value (threshold is a real value between 0 and 1, used to convert the predicted probability of output into class, as logistic regression predicts the probability) very low, we will put most of the predicted observations under the positive category, even when some of them should be placed under the negative category. On the other hand, keeping the threshold at a very high level penalizes the positive category, but the negative category will improve. Ideally, the threshold should be set in a way that trade-offs value between both categories and produces higher overall accuracy:

Optimum threshold = Threshold where maximum (sensitivity + specificity) is possible

Before we jump into the nitty-gritty, we will visualize the confusion matrix to understand the various following formulas:

The ROC curve will look as follows:

Rank ordering: After sorting observations in descending order by predicted probabilities, deciles are created (10 equal bins with 10 percent of total observations in each bin). By adding up the number of events in each decile, we will get aggregated events for each decile and this number should be in decreasing order, else it will be in serious violation of logistic regression methodology.

One way to think about why rank ordering is important? It will be very useful when we set the cut-off points at the top three to four deciles to send marketing campaigns where the segments have a higher chance of responding to the campaign. If rank order does not hold for the model, even after selecting the top three to four deciles, there will be a significant chunk left out below the cut-off point, which is dangerous.

Concordance/c-statistic: This is a measure of quality of fit for a binary outcome in a logistic regression model. It is a proportion of pairs in which the predicted event probability is higher for the actual event than non-event.

Example: In the following table, both actual and predicted values are shown with a sample of seven rows. Actual is the true category, either default or not; whereas predicted is predicted probabilities from the logistic regression model. Calculate the concordance value.

For calculating concordance, we need to split the table into two (each table with actual values as 1 and 0) and apply the Cartesian product of each row from both tables to form pairs:

In the following table, the complete Cartesian product has been calculated and has classified the pair as a concordant pair whenever the predicted probability for 1 category is higher than the predicted probability for 0 category. If it is the other way around, then the pair has been classified as a discordant pair. In special cases, if both probabilities are the same, those pairs will be classified as tied instead.

C-statistic: This is 0.83315 percent or 83.315 percent, and any value greater than 0.7 percent or 70 percent is considered a good model to use for practical purposes.
Divergence: The distance between the average score of default accounts and the average score of non-default accounts. The greater the distance, the more effective the scoring system is at segregating good and bad observations.
K-S statistic: This is the maximum distance between two population distributions. It helps with discriminating default accounts from non-default accounts.
Population stability index (PSI): This is the metric used to check that drift in the current population on which the credit scoring model will be used is the same as the population with respective to development time:
- PSI <= 0.1: This states no change in characteristics of the current population with respect to the development population
- 0.1 < PSI <= 0.25: This signifies some change has taken place and warns for attention, but can still be used
- PSI >0.25: This indicates a large shift in the score distribution of the current population compared with development time

官术网_书友最值得收藏!

Statistics for Machine Learning

Terminology involved in logistic regression