官术网_书友最值得收藏!

How it works...

We start by reading in our dataset, consisting of historical and continuing missile experiments in North Korea. We aim to predict the type of missile based on remaining features, such as facility and time of launch. This concludes step 1. In step 2, we apply scikit-learn's train_test_split method to subdivide X and y into a training set, X_train and y_train, and also a testing set, X_test and y_test. The test_size = 0.2 parameter means that the testing set consists of 20% of the original data, while the remainder is placed in the training set. The random_state parameter allows us to reproduce the same randomly generated split. Next, concerning step 3, it is important to note that, in applications, we often want to compare several different models. The danger of using the testing set to select the best model is that we may end up overfitting the testing set. This is similar to the statistical sin of data fishing. In order to combat this danger, we create an additional dataset, called the validation set. We train our models on the training set, use the validation set to compare them, and finally use the testing set to obtain an accurate indicator of the performance of the model we have chosen. So, in step 3, we choose our parameters so that, mathematically speaking, the end result consists of a training set of 60% of the original dataset, a validation set of 20%, and a testing set of 20%. Finally, we double-check our assumptions by employing the len function to compute the length of the arrays (step 4).

主站蜘蛛池模板: 马尔康县| 孟津县| 平乐县| 武威市| 通州市| 永靖县| 丹巴县| 岳西县| 临漳县| 巴南区| 扎兰屯市| 瑞丽市| 德化县| 台江县| 杭锦后旗| 宝丰县| 邛崃市| 河东区| 天门市| 白城市| 辽阳市| 东莞市| 兴安县| 潜江市| 南雄市| 石阡县| 绵阳市| 湟源县| 和田市| 宝兴县| 郑州市| 新密市| 宁城县| 盐山县| 泸定县| 南川市| 方正县| 宁武县| 丹寨县| 嘉荫县| 水城县|