官术网_书友最值得收藏!

Splitting the data

Finally, we want to split our data into training and test sets. We will train our classifier only on the training set, so it will never see the test set until we want to evaluate its performance. This is a very important step, because as we will see in the future, the quality of predictions on the test set can differ dramatically from the quality measured on the training set. Data splitting is an operation specific to machine learning tasks, so we will import scikit-learn (a machine learning package) and use some functions from it:

In []: 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42) 
X_train.shape, y_train.shape, X_test.shape, y_test.shape 
Out[]: 
 ((700, 6), (700,), (300, 6), (300,)) 

Now we have 700 training samples with 6 features each, and 300 test samples with the same number of features.

主站蜘蛛池模板: 边坝县| 房山区| 遂昌县| 德阳市| 冕宁县| 嘉义市| 南江县| 新乡市| 大港区| 贺州市| 浦江县| 丽江市| 定南县| 清原| 南川市| 乐安县| 康乐县| 郯城县| 大荔县| 贵溪市| 越西县| 沁源县| 兰溪市| 娱乐| 虞城县| 石嘴山市| 乌拉特后旗| 屏山县| 乐清市| 瑞金市| 明溪县| 磐安县| 颍上县| 大冶市| 富民县| 白朗县| 拉萨市| 泾源县| 益阳市| 德江县| 名山县|