- Hands-On Machine Learning with Microsoft Excel 2019
- Julio Cesar Rodriguez Martino
- 303字
- 2021-06-24 15:10:58
Comparing underfitting and overfitting
In the preceding list, step 4 implies an iterative process where we try models, parameters, and features until we get the best result that we can. Let's now think about a classification problem, where we want to separate squares from circles, as shown in the following diagram. At the beginning of the process, we will probably be in a situation that is similar to the first chart (on the left-hand side). The model fails to efficiently separate the two shapes and both sides are a mixture of both squares and circles. This is called underfitting and refers to a model that fails to represent the characteristics of the dataset:

As we continue tuning parameters and adjusting the model to the training dataset, we might find ourselves in a situation that is similar to the third chart (on the right-hand side). The model accurately splits the dataset, leaving only one shape on each side of the border line. Even if this seems correct, it completely lacks generalization. The result adjusts so well to the training data that it will be completely wrong to we test it against a different dataset. This problem is called overfitting.
To solve the problem of overfitting in our model, we need to increase its adaptability. However, making it too flexible can also make it bad at predicting. To avoid this, the usual solution is to use regularization techniques. There are many similar techniques that can be found in specialized literature, but they are beyond the scope of this book.
The center chart shows a more flexible model; it represents the dataset, but is general enough to deal with new, previously unseen data. It is often time-consuming and it can be difficult to get the right balance in order to build a good machine learning model.
- 大規模數據分析和建模:基于Spark與R
- 數據產品經理高效學習手冊:產品設計、技術常識與機器學習
- 數據挖掘原理與實踐
- 數據庫基礎與應用:Access 2010
- Mastering Ninject for Dependency Injection
- 計算機信息技術基礎實驗與習題
- SQL查詢:從入門到實踐(第4版)
- 深入淺出MySQL:數據庫開發、優化與管理維護(第2版)
- 數據架構與商業智能
- 一個64位操作系統的設計與實現
- Hadoop大數據開發案例教程與項目實戰(在線實驗+在線自測)
- Spark分布式處理實戰
- 大數據技術原理與應用:概念、存儲、處理、分析與應用
- 商業智能工具應用與數據可視化
- 利用Python進行數據分析(原書第2版)