官术网_书友最值得收藏!

Holdout sample

While working on a training dataset, a small portion of the data is kept aside for testing the performance of the models. The small portion of data is unseen data (not used in training), therefore one can rely on the measurements obtained for this data. The measurements obtained can be used to tune the parameters of the model or just to report out the performance of the model so as to set expectations in terms of what level of performance can be expected from the model.

It may be noted that the performance measurement reported out on the basis of a holdout sample is not as robust an estimate as that of a k-fold cross validation estimate. This is because there could be some unknown biases that could have crept in during the random split of the holdout set from the original dataset. Also, there are also no guarantees that the holdout dataset has a representation of all the classes involved in the training dataset. If we need representation of all classes in the holdout dataset, then a special technique called a stratified holdout sample needs to be applied. This ensures that there is representation for all classes in the holdout dataset. It is obvious that a performance measurement obtained from a stratified holdout sample is a better estimate of performance than that of the estimate of performance obtained from a nonstratified holdout sample.

70%-30%, 80%-20%, and 90%-10% are generally the sets of training data-holdout data splits observed in ML projects.

主站蜘蛛池模板: 淮滨县| 刚察县| 安吉县| 芜湖县| 万年县| 沾化县| 双峰县| 京山县| 宝兴县| 年辖:市辖区| 济南市| 含山县| 英吉沙县| 永修县| 临西县| 来凤县| 闸北区| 屏边| 乐昌市| 壶关县| 沅江市| 文安县| 海口市| 依兰县| 铜川市| 台安县| 大埔县| 沅江市| 昭觉县| 香港 | 嘉黎县| 新兴县| 桓仁| 鄄城县| 广宁县| 桂平市| 宁城县| 开原市| 巴彦淖尔市| 武胜县| 剑川县|