官术网_书友最值得收藏!

Splitting the data

Finally, we want to split our data into training and test sets. We will train our classifier only on the training set, so it will never see the test set until we want to evaluate its performance. This is a very important step, because as we will see in the future, the quality of predictions on the test set can differ dramatically from the quality measured on the training set. Data splitting is an operation specific to machine learning tasks, so we will import scikit-learn (a machine learning package) and use some functions from it:

In []: 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42) 
X_train.shape, y_train.shape, X_test.shape, y_test.shape 
Out[]: 
 ((700, 6), (700,), (300, 6), (300,)) 

Now we have 700 training samples with 6 features each, and 300 test samples with the same number of features.

主站蜘蛛池模板: 峡江县| 盐城市| 蛟河市| 龙江县| 霍邱县| 南陵县| 井研县| 临湘市| 凌海市| 甘德县| 常德市| 乌鲁木齐县| 册亨县| 鞍山市| 兴宁市| 台南市| 霍州市| 新源县| 绍兴县| 重庆市| 潢川县| 瓦房店市| 怀集县| 横山县| 莆田市| 谷城县| 临高县| 兴义市| 镇远县| 蒙自县| 北宁市| 时尚| 哈密市| 资源县| 岑溪市| 永丰县| 曲靖市| 辉县市| 扶沟县| 平和县| 永靖县|