官术网_书友最值得收藏!

Data splitting

Training the parameters of a prediction function and testing it on the same data is an incorrect procedure from a methodological point of view. A model is used simply to predict the sample labels. If used during the training phase, it would have a perfect score, but would not be able to predict anything useful on the data that hasn't previously been explored. This situation is called overfitting. To avoid this, it is a common practice to run an automatic learning experiment (data splitting) to provide some of the data that's available as a training set and a test set.

Data splitting is an operation that allows us to divide the available data into two sets, generally for cross-validation purposes. A dataset is used to train a predictive model, and the other to test the model's performance. Training and testing the model forms the basis for further usage of the model for prediction in predictive analytics. For example, if given a dataset that has 100 rows of data, which includes the predictor and response variables, we will split the dataset into a convenient ratio (say 70:30) and allocate 70 rows for training and 30 rows for testing. The rows will be selected randomly to reduce bias. Once the training data is available, the data is fed to the neural network to get the massive universal function in place. The training data determines the weights, biases, and activation functions to be used so that we can get to output from input.

Once sufficient convergence is achieved, the model is stored in memory and the next step is testing the model. We pass the 30 rows of data to check if the actual output matches the predicted output from the model. The evaluation is used to get various metrics that can validate the model. If the accuracy is too wary, the model has to be rebuilt with changes in the training data and other parameters passed to the neural network builder.

To split the data, the scikit-learn library has been used. More specifically, the sklearn.model_selection.train_test_split() function has been used. This function quickly computes a random split into training and test sets.

Let's start by importing the function:

from sklearn.model_selection import train_test_split

At this point, to make work easier for us, we will divide the starting DataFrame into two: predictors (X) and target (Y).

To do this, the pandas.DataFrame.drop() function will be used:

X = DataScaled.drop('medv', axis = 1)
print(X.describe())
Y = DataScaled['medv']
print(Y.describe())

The pandas.DataFrame.drop() function drops specified labels from rows or columns. We will remove rows or columns by specifying label names and the corresponding axis, or by specifying the index or column names directly. When using a multi-index, labels on different levels can be removed by specifying the level. To extract X, we have removed the target column (medv) from the starting DataScaled DataFrame.

The following results are returned:

X shape = (506, 13)
Y shape = (506,)

So, X has 13 columns (predictors) and Y has only one column (target). Now, we have to split the two DataFrames:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 5)
print(‘X train shape = ’,X_train.shape)
print(‘X test shape = ’, X_test.shape)
print(‘Y train shape = ’, Y_train.shape)
print(‘Y test shape = ’,Y_test.shape)

In the train_test_split() function, four parameters are passed, namely X, Y, test_size, and random_state. X and Y are predictors and target DataFrames. The test_size parameter can take the following types: float, integer or none, and optional (default=0.25). If float is between 0.0 and 1.0, it represents the proportion of the dataset to include in the test split. If the parameter is int, it represents the absolute number of test samples. If the parameter is None, the value is set to complement the train size. By default, the value is set to 0.25. In our case, we set test_size = 0.30, which means that 30% of the data is divided up as test data. Finally, the random_state parameter is used to set the seed used by the random number generator. In this way, the repetitive splitting of the operation is guaranteed.

The following results are returned:

X train shape = (354, 13)
X test shape = (152, 13)
Y train shape = (354,)
Y test shape = (152,)

So, the starting DataFrame is split into two datasets that have 354 rows (X_train) and 152 rows (X_test). A similar subdivision was made for Y.

主站蜘蛛池模板: 察雅县| 广河县| 卓资县| 竹溪县| 漳平市| 北宁市| 吉林省| 台北市| 乐业县| 东乌珠穆沁旗| 禹城市| 信丰县| 太仓市| 秦安县| 进贤县| 阿勒泰市| 榆社县| 化州市| 吴忠市| 常熟市| 永平县| 平定县| 内乡县| 都匀市| 凌海市| 建德市| 罗山县| 申扎县| 康保县| 永修县| 泰安市| 镇江市| 茌平县| 塔河县| 麦盖提县| 辽宁省| 库尔勒市| 蒲江县| 宜城市| 中江县| 湖州市|