官术网_书友最值得收藏!

Train and test sets

To estimate the generalization error, we split our data into two parts: training data and testing data. A general rule of thumb is to split them by the training: testing ratio, that is, 70:30. We first train the predictor on the training data, then predict the values for the test data, and finally, compute the error, that is, the difference between the predicted and the true values. This gives us an estimate of the true generalization error.

The estimation is based on the two following assumptions: first, we assume that the test set is an unbiased sample from our dataset; and second, we assume that the actual new data will reassemble the distribution as our training and testing examples. The first assumption can be mitigated by cross-validation and stratification. Also, if it is scarce, one can't afford to leave out a considerable amount of data for a separate test set, as learning algorithms do not perform well if they don't receive enough data. In such cases, cross-validation is used instead.

主站蜘蛛池模板: 苍溪县| 金门县| 屯昌县| 大荔县| 罗甸县| 宁南县| 湖口县| 张家川| 红安县| 敦化市| 南昌县| 镇安县| 康保县| 汤阴县| 巩留县| 青阳县| 时尚| 德清县| 武山县| 漳州市| 临沂市| 安岳县| 出国| 农安县| 牡丹江市| 清流县| 梓潼县| 六盘水市| 霍城县| 稷山县| 勃利县| 元氏县| 宁蒗| 内江市| 金门县| 唐海县| 额尔古纳市| 张家界市| 潜江市| 当阳市| 汝南县|