官术网_书友最值得收藏!

Cross-validation

Cross-validation (which you may hear some data scientists refer to as rotation estimation, or simply a general technique for assessing models), is another method for assessing a model's performance (or its accuracy).

Mainly used with predictive modeling to estimate how accurately a model might perform in practice, one might see cross-validation used to check how a model will potentially generalize; in other words, how the model will apply what it infers from samples, to an entire population (or dataset).

With cross-validation, you identify a (known) dataset as your validation dataset on which training is run, along with a dataset of unknown data (or first seen data) against which the model will be tested (this is known as your testing dataset). The objective is to ensure that problems such as overfitting (allowing non-inclusive information to influence results) are controlled, as well as provide an insight on how the model will generalize a real problem or on a real data file.

This process will consist of separating data into samples of similar subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set):

Separation → Analysis → Validation

To reduce variability, multiple iterations (also called folds or rounds) of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Typically, a data scientist will use a model's stability to determine the actual number of rounds of cross-validation that should be performed.

Again, the cross-validation method can perhaps be better understood by thinking about selecting a subset of data and manually calculating the results. Once you know the correct results, they can be compared to the model-produced results (using a separate subset of data). This is one round. Multiple rounds would be performed and the compared results averaged and reviewed, eventually providing a fair estimate of a model's prediction performance.

Suppose a university provides data on its student body over time. The students are described as having various characteristics, such as having a High School GPA greater or less than 3.0, if they have a family member that graduated from the school, if the student was active in non-program activities, was a resident (lived on campus), was a student athlete, and so on. Our predictive model wants to predict what characteristics students who graduate early have.

The following table is a representation of the results of using a five-round cross-validation process to predict our model's expected accuracy:

Cross-validation

Given the preceding figures, I'd say our predictive model is expected to be very accurate!

In summary, cross-validation combines (averages) measures of fit (prediction error) to derive a more accurate estimate of model prediction performance. This method is typically used in cases where there is not enough data available to test without losing significant modeling or testing quality.

主站蜘蛛池模板: 扎囊县| 盐源县| 阳城县| 南丹县| 临漳县| 德州市| 麻江县| 榆林市| 温泉县| 车险| 隆化县| 临江市| 正镶白旗| 苍山县| 英山县| 巍山| 镇康县| 舟曲县| 镇康县| 民县| 囊谦县| 潮州市| 清镇市| 长岛县| 中牟县| 巨鹿县| 伊宁县| 霸州市| 漠河县| 金溪县| 永嘉县| 莫力| 澜沧| 堆龙德庆县| 宁强县| 阳高县| 长沙市| 伊通| 南安市| 阆中市| 偏关县|