官术网_书友最值得收藏!

Cross-validation

Cross-validation (which you may hear some data scientists refer to as rotation estimation, or simply a general technique for assessing models), is another method for assessing a model's performance (or its accuracy).

Mainly used with predictive modeling to estimate how accurately a model might perform in practice, one might see cross-validation used to check how a model will potentially generalize; in other words, how the model will apply what it infers from samples, to an entire population (or dataset).

With cross-validation, you identify a (known) dataset as your validation dataset on which training is run, along with a dataset of unknown data (or first seen data) against which the model will be tested (this is known as your testing dataset). The objective is to ensure that problems such as overfitting (allowing non-inclusive information to influence results) are controlled, as well as provide an insight on how the model will generalize a real problem or on a real data file.

This process will consist of separating data into samples of similar subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set):

Separation → Analysis → Validation

To reduce variability, multiple iterations (also called folds or rounds) of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Typically, a data scientist will use a model's stability to determine the actual number of rounds of cross-validation that should be performed.

Again, the cross-validation method can perhaps be better understood by thinking about selecting a subset of data and manually calculating the results. Once you know the correct results, they can be compared to the model-produced results (using a separate subset of data). This is one round. Multiple rounds would be performed and the compared results averaged and reviewed, eventually providing a fair estimate of a model's prediction performance.

Suppose a university provides data on its student body over time. The students are described as having various characteristics, such as having a High School GPA greater or less than 3.0, if they have a family member that graduated from the school, if the student was active in non-program activities, was a resident (lived on campus), was a student athlete, and so on. Our predictive model wants to predict what characteristics students who graduate early have.

The following table is a representation of the results of using a five-round cross-validation process to predict our model's expected accuracy:

Cross-validation

Given the preceding figures, I'd say our predictive model is expected to be very accurate!

In summary, cross-validation combines (averages) measures of fit (prediction error) to derive a more accurate estimate of model prediction performance. This method is typically used in cases where there is not enough data available to test without losing significant modeling or testing quality.

主站蜘蛛池模板: 景谷| 澄迈县| 句容市| 长治市| 沁阳市| 丹凤县| 玛多县| 五常市| 饶阳县| 连南| 湟中县| 嫩江县| 金昌市| 仙游县| 宣武区| 都兰县| 阜阳市| 丰都县| 辽源市| 永胜县| 射阳县| 手游| 莆田市| 遂溪县| 滦南县| 郸城县| 清流县| 江西省| 南投县| 肇源县| 枣阳市| 阳朔县| 墨竹工卡县| 察雅县| 康乐县| 宁蒗| 永济市| 南皮县| 海晏县| 建瓯市| 桐庐县|