- Machine Learning Algorithms
- Giuseppe Bonaccorso
- 278字
- 2021-07-02 18:53:29
Creating training and test sets
When a dataset is large enough, it's a good practice to split it into training and test sets; the former to be used for training the model and the latter to test its performances. In the following figure, there's a schematic representation of this process:

There are two main rules in performing such an operation:
- Both datasets must reflect the original distribution
- The original dataset must be randomly shuffled before the split phase in order to avoid a correlation between consequent elements
With scikit-learn, this can be achieved using the train_test_split() function:
from sklearn.model_selection import train_test_split
>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=1000)
The parameter test_size (as well as training_size) allows specifying the percentage of elements to put into the test/training set. In this case, the ratio is 75 percent for training and 25 percent for the test phase. Another important parameter is random_state which can accept a NumPy RandomState generator or an integer seed. In many cases, it's important to provide reproducibility for the experiments, so it's also necessary to avoid using different seeds and, consequently, different random splits:
from sklearn.utils import check_random_state
>>> rs = check_random_state(1000)
<mtrand.RandomState at 0x12214708>
>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=rs)
In this way, if the seed is kept equal, all experiments have to lead to the same results and can be easily reproduced in different environments by other scientists.
- Learning Apex Programming
- Mastering SVG
- 新手學Visual C# 2008程序設計
- Magento 2 Development Cookbook
- 新印象:解構UI界面設計
- HTML5游戲開發實戰
- Offer來了:Java面試核心知識點精講(框架篇)
- 深入大型數據集:并行與分布化Python代碼
- Implementing Domain:Specific Languages with Xtext and Xtend
- Selenium Essentials
- 程序員超強大腦
- 面向對象程序設計及C++實驗指導(第3版)
- 軟件工程實用教程 (第3版)
- Python大數據分析與應用實戰
- Learning AirWatch