官术网_书友最值得收藏!

Discriminant analysis overview

Discriminant Analysis (DA), also known as Fisher Discriminant Analysis (FDA), is another popular classification technique. It can be an effective alternative to logistic regression when the classes are well-separated. If you have a classification problem where the outcome classes are well-separated, logistic regression can have unstable estimates, which is to say that the confidence intervals are wide and the estimates themselves likely vary from one sample to another (James, 2013). DA does not suffer from this problem and, as a result, may outperform and be more generalized than logistic regression. Conversely, if there are complex relationships between the features and outcome variables, it may perform poorly on a classification task. For our breast cancer example, logistic regression performed well on the testing and training sets, and the classes were not well-separated. For the purpose of comparison with logistic regression, we will explore DA, both Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA).

DA utilizes Baye's theorem in order to determine the probability of the class membership for each observation. If you have two classes, for example, benign and malignant, then DA will calculate an observation's probability for both the classes and select the highest probability as the proper class.

Bayes' theorem states that the probability of Y occurring--given that X has occurred--is equal to the probability of both Y and X occurring, Pided by the probability of X occurring, which can be written as follows:

  

The numerator in this expression is the likelihood that an observation is from that class level and has these feature values. The denominator is the likelihood of an observation that has these feature values across all the levels. Again, the classification rule says that if you have the joint distribution of X and Y and if X is given, the optimal decision about which class to assign an observation to is by choosing the class with the larger probability (the posterior probability).

The process of attaining posterior probabilities goes through the following steps:

  1. Collect data with a known class membership.
  2. Calculate the prior probabilities; this represents the proportion of the sample that belongs to each class.
  3. Calculate the mean for each feature by their class.
  4. Calculate the variance--covariance matrix for each feature; if it is an LDA, then this would be a pooled matrix of all the classes, giving us a linear classifier, and if it is a QDA, then a variance--covariance created for each class.
  5. Estimate the normal distribution (Gaussian densities) for each class.
  6. Compute the discriminant function that is the rule for the classification of a new object.
  7. Assign an observation to a class based on the discriminant function.

This will provide an expanded notation on the determination of the posterior probabilities, as follows:

Even though LDA is elegantly simple, it is limited by the assumption that the observations of each class are said to have a multivariate normal distribution, and there is a common covariance across the classes. QDA still assumes that observations come from a normal distribution, but it also assumes that each class has its own covariance.

Why does this matter? When you relax the common covariance assumption, you now allow quadratic terms into the discriminant score calculations, which was not possible with LDA. The mathematics behind this can be a bit intimidating and are outside the scope of this book. The important part to remember is that QDA is a more flexible technique than logistic regression, but we must keep in mind our bias-variance trade-off. With a more flexible technique, you are likely to have a lower bias but potentially a higher variance. Like a lot of flexible techniques, a robust set of training data is needed to mitigate a high classifier variance.

主站蜘蛛池模板: 隆化县| 大荔县| 阳城县| 房产| 东阿县| 华坪县| 榆树市| 龙海市| 巨野县| 马鞍山市| 田阳县| 邓州市| 乐昌市| 社旗县| 花莲市| 中超| 高邑县| 桦川县| 海兴县| 理塘县| 中牟县| 北票市| 博客| 南阳市| 四子王旗| 九龙县| 瓮安县| 洮南市| 裕民县| 安吉县| 泽库县| 朝阳县| 海阳市| 和硕县| 永新县| 南江县| 河西区| 德兴市| 洪湖市| 泾阳县| 新闻|