官术网_书友最值得收藏!

Creating training and test sets

When a dataset is large enough, it's a good practice to split it into training and test sets; the former to be used for training the model and the latter to test its performances. In the following figure, there's a schematic representation of this process:

There are two main rules in performing such an operation:

  • Both datasets must reflect the original distribution
  • The original dataset must be randomly shuffled before the split phase in order to avoid a correlation between consequent elements

With scikit-learn, this can be achieved using the train_test_split() function:

from sklearn.model_selection import train_test_split

>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=1000)

The parameter test_size (as well as training_size) allows specifying the percentage of elements to put into the test/training set. In this case, the ratio is 75 percent for training and 25 percent for the test phase. Another important parameter is random_state which can accept a NumPy RandomState generator or an integer seed. In many cases, it's important to provide reproducibility for the experiments, so it's also necessary to avoid using different seeds and, consequently, different random splits:

My suggestion is to always use the same number (it can also be 0 or completely omitted), or define a global RandomState which can be passed to all requiring functions.
from sklearn.utils import check_random_state

>>> rs = check_random_state(1000)
<mtrand.RandomState at 0x12214708>

>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=rs)

In this way, if the seed is kept equal, all experiments have to lead to the same results and can be easily reproduced in different environments by other scientists.

For further information about NumPy random number generation, visit  https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.RandomState.html.
主站蜘蛛池模板: 鄂州市| 多伦县| 平陆县| 富锦市| 伽师县| 宾阳县| 深州市| 麻城市| 抚顺县| 景德镇市| 阿荣旗| 望谟县| 柯坪县| 沈阳市| 泰来县| 永丰县| 永福县| 梁河县| 榆社县| 荃湾区| 鲁甸县| 长宁县| 龙里县| 武隆县| 韶关市| 十堰市| 安义县| 肇州县| 察隅县| 贺州市| 安远县| 桐城市| 南溪县| 蚌埠市| 安顺市| 永春县| 克拉玛依市| 静乐县| 武冈市| 山阴县| 弥勒县|