官术网_书友最值得收藏!

  • R Machine Learning Projects
  • Dr. Sunil Kumar Chinnamgari
  • 414字
  • 2021-07-02 14:23:07

Class imbalance problem

Let's assume that one needs to build a classifier that identifies cat and dog images. The problem has two classes namely cat and dog. If one were to train a classification model, training data is required. The training data in this case is based on images of dogs and cats given as input so a supervised learning model can learn the features of dogs versus cats.

It may so happen that if there are 100 images available for training in the dataset and 95 of them are dog pictures, five of them are cat pictures. This kind of unequal representation of different classes in a training dataset is termed as a class imbalance problem.

Most ML techniques work best when the number of examples in each class are roughly equal. One can employ certain techniques to counter class imbalance problems in data. One technique is to reduce the majority class (images of dogs) samples and make them equal to the minority class (images of cats). In this case, there is information loss as a lot of the dog images go unused. Another option is to generate synthetic data similar to the data for the minority class (images of cats) so as to make the number of data samples equal to the majority class. Synthetic minority over-sampling technique (SMOTE) is a very popular technique for generating synthetic data.

It may be noted that accuracy is not a good metric for evaluating the performance of models where the training dataset experiences class imbalance problems. Assume a model built based on a class-imbalanced dataset that predicts a majority class for any test sample that it is asked to predict on. In this case, one gets 95% accuracy as roughly 95% of the images are dog images in the test dataset. But this performance can only be termed as a hoax as the model does not have any discriminative power—it just predicts dog as the class for any image it needs to predict about. In this case, it just happened that every image is predicted as a dog, but still the model got away with a very high accuracy indicating that it is a great model, whether it is in reality or not!

There are several other performance metrics available to use in a situation where a class imbalance is a problem, F1 score and the area under the curve of the receiver operating characteristic (AUCROC) are some of the popular ones.

主站蜘蛛池模板: 嘉祥县| 托克逊县| 南安市| 渝北区| 鹤岗市| 旬邑县| 珠海市| 科尔| 潜山县| 定襄县| 海丰县| 定州市| 建瓯市| 邛崃市| 台湾省| 句容市| 博白县| 新巴尔虎左旗| 白银市| 蒙阴县| 阳春市| 衡阳县| 万荣县| 陆川县| 衡阳县| 积石山| 新闻| 都兰县| 察雅县| 高尔夫| 中江县| 鲁山县| 镶黄旗| 昌平区| 拉萨市| 嵩明县| 河池市| 怀来县| 赣榆县| 嘉黎县| 藁城市|