官术网_书友最值得收藏!

General machine learning rule of thumb

The general machine learning rule of thumb is that the more data there is, the better the predictive model. However, having more features often creates a mess, to the extent that the performance degrades drastically, especially if the dataset is high-dimensional. The entire learning process requires input datasets that can be split into three types (or are already provided as such):

  • A training set is the knowledge base coming from historical or live data that is used to fit the parameters of the ML algorithm. During the training phase, the ML model utilizes the training set to find optimal weights of the network and reach the objective function by minimizing the training error. Here, the back-prop rule or an optimization algorithm is used to train the model, but all the hyperparameters are needed to be set before the learning process starts.
  • A validation set is a set of examples used to tune the parameters of an ML model. It ensures that the model is trained well and generalizes toward avoiding overfitting. Some ML practitioners refer to it as a development set or dev set as well.
  • A test set is used for evaluating the performance of the trained model on unseen data. This step is also referred to as model inferencing. After assessing the final model on the test set (that is, when we're fully satisfied with the model's performance), we do not have to tune the model any further, but the trained model can be deployed in a production-ready environment.

A common practice is splitting the input data (after necessary pre-processing and feature engineering) into 60% for training, 10% for validation, and 20% for testing, but it really depends on use cases. Sometimes, we also need to perform up-sampling or down-sampling on the data based on the availability and quality of the datasets.

This rule of thumb of learning on different types of training sets can differ across machine learning tasks, as we will cover in the next section. However, before that, let's take a quick look at a few common phenomena in machine learning.

主站蜘蛛池模板: 平泉县| 青川县| 洛宁县| 津南区| 漳平市| 百色市| 宁阳县| 上杭县| 定南县| 宁都县| 永定县| 肥西县| 威海市| 沧州市| 南开区| 普兰店市| 民县| 垦利县| 黎城县| 宜州市| 景德镇市| 牡丹江市| 元氏县| 潮州市| 手游| 翁源县| 阜新| 沙洋县| 天气| 安阳县| 云南省| 拉孜县| 本溪市| 双桥区| 扶风县| 定州市| 德兴市| 邻水| 体育| 太谷县| 阿拉善左旗|