官术网_书友最值得收藏!

Splitting the data

Finally, we want to split our data into training and test sets. We will train our classifier only on the training set, so it will never see the test set until we want to evaluate its performance. This is a very important step, because as we will see in the future, the quality of predictions on the test set can differ dramatically from the quality measured on the training set. Data splitting is an operation specific to machine learning tasks, so we will import scikit-learn (a machine learning package) and use some functions from it:

In []: 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42) 
X_train.shape, y_train.shape, X_test.shape, y_test.shape 
Out[]: 
 ((700, 6), (700,), (300, 6), (300,)) 

Now we have 700 training samples with 6 features each, and 300 test samples with the same number of features.

主站蜘蛛池模板: 油尖旺区| 仙桃市| 禹城市| 铜山县| 长沙市| 湘阴县| 德安县| 克什克腾旗| 泉州市| 公主岭市| 南昌市| 青阳县| 伊金霍洛旗| 休宁县| 宜兰县| 周口市| 武义县| 方山县| 万宁市| 宣化县| 西充县| 嫩江县| 梁平县| 普兰店市| 太白县| 县级市| 安徽省| 志丹县| 莒南县| 宜宾市| 冕宁县| 夏河县| 盐城市| 岗巴县| 苏尼特左旗| 台江县| 德令哈市| 岢岚县| 靖宇县| 南靖县| 财经|