官术网_书友最值得收藏!

How it works...

We start by reading in our dataset, consisting of historical and continuing missile experiments in North Korea. We aim to predict the type of missile based on remaining features, such as facility and time of launch. This concludes step 1. In step 2, we apply scikit-learn's train_test_split method to subdivide X and y into a training set, X_train and y_train, and also a testing set, X_test and y_test. The test_size = 0.2 parameter means that the testing set consists of 20% of the original data, while the remainder is placed in the training set. The random_state parameter allows us to reproduce the same randomly generated split. Next, concerning step 3, it is important to note that, in applications, we often want to compare several different models. The danger of using the testing set to select the best model is that we may end up overfitting the testing set. This is similar to the statistical sin of data fishing. In order to combat this danger, we create an additional dataset, called the validation set. We train our models on the training set, use the validation set to compare them, and finally use the testing set to obtain an accurate indicator of the performance of the model we have chosen. So, in step 3, we choose our parameters so that, mathematically speaking, the end result consists of a training set of 60% of the original dataset, a validation set of 20%, and a testing set of 20%. Finally, we double-check our assumptions by employing the len function to compute the length of the arrays (step 4).

主站蜘蛛池模板: 甘南县| 河间市| 德江县| 开远市| 宁德市| 普洱| 永修县| 三门峡市| 江川县| 天祝| 吴桥县| 武宁县| 隆回县| 双流县| 牟定县| 丹棱县| 册亨县| 嘉禾县| 邵阳市| 壤塘县| 铜鼓县| 新乡市| 弋阳县| 贡觉县| 渝中区| 潍坊市| 淮安市| 苗栗市| 古丈县| 池州市| 托克托县| 五华县| 九龙城区| 聂荣县| 阿城市| 广东省| 沙河市| 蛟河市| 门头沟区| 广水市| 得荣县|