官术网_书友最值得收藏!

Train, validation, and test data

Cross-validation is not popular in the statistical modeling world for many reasons; statistical models are linear in nature and robust, and do not have a high variance/overfitting problem. Hence, the model fit will remain the same either on train or test data, which does not hold true in the machine learning world. Also, in statistical modeling, lots of tests are performed at the individual parameter level apart from aggregated metrics, whereas in machine learning we do not have visibility at the individual parameter level:

In the following code, both the R and Python implementation has been provided. If none of the percentages are provided, the default parameters are 50 percent for train data, 25 percent for validation data, and 25 percent for the remaining test data.

Python implementation has only one train and test split functionality, hence we have used it twice and also used the number of observations to split rather than the percentage (as shown in the previous train and test split example). Hence, a customized function is needed to split into three datasets:

>>> import pandas as pd       
>>> from sklearn.model_selection import train_test_split               
                         
>>> original_data = pd.read_csv("mtcars.csv")                    
  
>>> def data_split(dat,trf = 0.5,vlf=0.25,tsf = 0.25): 
...      nrows = dat.shape[0]     
...      trnr = int(nrows*trf) 
...      vlnr = int(nrows*vlf)     

The following Python code splits the data into training and the remaining data. The remaining data will be further split into validation and test datasets:

...      tr_data,rmng = train_test_split(dat,train_size = trnr,random_state=42) 
...      vl_data, ts_data = train_test_split(rmng,train_size = vlnr,random_state=45)     
...      return (tr_data,vl_data,ts_data) 

Implementation of the split function on the original data to create three datasets (by 50 percent, 25 percent, and 25 percent splits) is as follows:

>>> train_data, validation_data, test_data = data_split (original_data ,trf=0.5, vlf=0.25,tsf=0.25) 

The R code for the train, validation, and test split is as follows:

# Train Validation & Test samples 
trvaltest <- function(dat,prop = c(0.5,0.25,0.25)){ 
  nrw = nrow(dat) 
  trnr = as.integer(nrw *prop[1]) 
  vlnr = as.integer(nrw*prop[2]) 
  set.seed(123) 
  trni = sample(1:nrow(dat),trnr) 
  trndata = dat[trni,] 
  rmng = dat[-trni,] 
  vlni = sample(1:nrow(rmng),vlnr) 
  valdata = rmng[vlni,] 
  tstdata = rmng[-vlni,] 
  mylist = list("trn" = trndata,"val"= valdata,"tst" = tstdata) 
  return(mylist) 
} 
outdata = trvaltest(mtcars,prop = c(0.5,0.25,0.25)) 
train_data = outdata$trn; valid_data = outdata$val; test_data = outdata$tst 
主站蜘蛛池模板: 丹江口市| 丰原市| 华安县| 钦州市| 香港| 岳普湖县| 鄂伦春自治旗| 西青区| 石狮市| 昌平区| 高雄县| 麦盖提县| 光山县| 开原市| 教育| 甘孜县| 张北县| 峨山| 宜州市| 乌拉特后旗| 柞水县| 荣昌县| 宁晋县| 信丰县| 和龙市| 常德市| 武川县| 乾安县| 同江市| 仁寿县| 贵阳市| 锡林郭勒盟| 杨浦区| 玉环县| 遂溪县| 全州县| 沾化县| 舞阳县| 蒙阴县| 西藏| 托克逊县|