官术网_书友最值得收藏!

Underfitting and overfitting

The purpose of a machine learning model is to approximate an unknown function that associates input elements to output ones (for a classifier, we call them classes). However, a training set is normally a representation of a global distribution, but it cannot contain all possible elements; otherwise the problem could be solved with a one-to-one association. In the same way, we don't know the analytic expression of a possible underlying function, therefore, when training, it's necessary to think about fitting the model but keeping it free to generalize when an unknown input is presented. Unfortunately, this ideal condition is not always easy to find and it's important to consider two different dangers:

  • Underfitting: It means that the model isn't able to capture the dynamics shown by the same training set (probably because its capacity is too limited).
  • Overfitting: the model has an excessive capacity and it's not more able to generalize considering the original dynamics provided by the training set. It can associate almost perfectly all the known samples to the corresponding output values, but when an unknown input is presented, the corresponding prediction error can be very high.

In the following picture, there are examples of interpolation with low-capacity (underfitting), normal-capacity (normal fitting), and excessive capacity (overfitting):

It's very important to avoid both underfitting and overfitting. Underfitting is easier to detect considering the prediction error, while overfitting may prove to be more difficult to discover as it could be initially considered the result of a perfect fitting.

Cross-validation and other techniques that we're going to discuss in the next chapters can easily show how our model works with test samples never seen during the training phase. That way, it would be possible to assess the generalization ability in a broader context (remember that we're not working with all possible values, but always with a subset that should reflect the original distribution).

However, a generic rule of thumb says that a residual error is always necessary to guarantee a good generalization ability, while a model that shows a validation accuracy of 99.999... percent on training samples is almost surely overfitted and will likely be unable to predict correctly when never-seen input samples are provided. 

主站蜘蛛池模板: 萨嘎县| 嘉祥县| 中超| 淳化县| 海原县| 屏东县| 凤凰县| 闸北区| 离岛区| 利川市| 盐城市| 大连市| 江安县| 庄河市| 资中县| 湾仔区| 河东区| 永寿县| 安泽县| 冷水江市| 海兴县| 云阳县| 铁力市| 宁远县| 秦皇岛市| 广饶县| 炉霍县| 靖边县| 体育| 淮北市| 延川县| 中牟县| 堆龙德庆县| 建德市| 施秉县| 西和县| 荔浦县| 白银市| 壶关县| 建湖县| 米泉市|