官术网_书友最值得收藏!

Grid search

Grid search in machine learning is a popular way to tune the hyperparameters of the model in order to find the best combination for determining the best fit:

In the following code, implementation has been performed to determine whether a particular user will click an ad or not. Grid search has been implemented using a decision tree classifier for classification purposes. Tuning parameters are the depth of the tree, the minimum number of observations in terminal node, and the minimum number of observations required to perform the node split:

# Grid search 
>>> import pandas as pd 
>>> from sklearn.tree import DecisionTreeClassifier 
>>> from sklearn.model_selection import train_test_split 
>>> from sklearn.metrics import classification_report,confusion_matrix,accuracy_score 
>>> from sklearn.pipeline import Pipeline 
>>> from sklearn.grid_search import GridSearchCV 
 
>>> input_data = pd.read_csv("ad.csv",header=None)                        
 
>>> X_columns = set(input_data.columns.values) 
>>> y = input_data[len(input_data.columns.values)-1] 
>>> X_columns.remove(len(input_data.columns.values)-1) 
>>> X = input_data[list(X_columns)] 

Split the data into train and testing:

>>> X_train, X_test,y_train,y_test = train_test_split(X,y,train_size = 0.7,random_state=33) 

Create a pipeline to create combinations of variables for the grid search:

>>> pipeline = Pipeline([ 
...      ('clf', DecisionTreeClassifier(criterion='entropy')) ]) 

Combinations to explore are given as parameters in Python dictionary format:

>>> parameters = { 
...      'clf__max_depth': (50,100,150), 
...      'clf__min_samples_split': (2, 3), 
...      'clf__min_samples_leaf': (1, 2, 3)} 

The n_jobs field is for selecting the number of cores in a computer; -1 means it uses all the cores in the computer. The scoring methodology is accuracy, in which many other options can be chosen, such as precision, recall, and f1:

>>> grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy') 
>>> grid_search.fit(X_train, y_train)  

Predict using the best parameters of grid search:

>>> y_pred = grid_search.predict(X_test) 

The output is as follows:

>>> print ('\n Best score: \n', grid_search.best_score_) 
>>> print ('\n Best parameters set: \n')   
>>> best_parameters = grid_search.best_estimator_.get_params() 
>>> for param_name in sorted(parameters.keys()): 
>>>     print ('\t%s: %r' % (param_name, best_parameters[param_name])) 
>>> print ("\n Confusion Matrix on Test data \n",confusion_matrix(y_test,y_pred)) 
>>> print ("\n Test Accuracy \n",accuracy_score(y_test,y_pred)) 
>>> print ("\nPrecision Recall f1 table \n",classification_report(y_test, y_pred)) 

The R code for grid searches on decision trees is as follows:

# Grid Search on Decision Trees 
library(rpart) 
input_data = read.csv("ad.csv",header=FALSE) 
input_data$V1559 = as.factor(input_data$V1559) 
set.seed(123) 
numrow = nrow(input_data) 
trnind = sample(1:numrow,size = as.integer(0.7*numrow)) 
 
train_data = input_data[trnind,];test_data = input_data[-trnind,] 
minspset = c(2,3);minobset = c(1,2,3) 
initacc = 0 
 
for (minsp in minspset){ 
  for (minob in minobset){ 
    tr_fit = rpart(V1559 ~.,data = train_data,method = "class",minsplit = minsp, minbucket = minob) 
    tr_predt = predict(tr_fit,newdata = train_data,type = "class") 
    tble = table(tr_predt,train_data$V1559) 
    acc = (tble[1,1]+tble[2,2])/sum(tble) 
    acc 
    if (acc > initacc){ 
      tr_predtst = predict(tr_fit,newdata = test_data,type = "class") 
      tblet = table(test_data$V1559,tr_predtst) 
      acct = (tblet[1,1]+tblet[2,2])/sum(tblet) 
      acct 
      print(paste("Best Score")) 
      print( paste("Train Accuracy ",round(acc,3),"Test Accuracy",round(acct,3))) 
      print( paste(" Min split ",minsp," Min obs per node ",minob)) 
      print(paste("Confusion matrix on test data")) 
      print(tblet) 
      precsn_0 = (tblet[1,1])/(tblet[1,1]+tblet[2,1]) 
      precsn_1 = (tblet[2,2])/(tblet[1,2]+tblet[2,2]) 
      print(paste("Precision_0: ",round(precsn_0,3),"Precision_1: ",round(precsn_1,3))) 
      rcall_0 = (tblet[1,1])/(tblet[1,1]+tblet[1,2]) 
      rcall_1 = (tblet[2,2])/(tblet[2,1]+tblet[2,2]) 
      print(paste("Recall_0: ",round(rcall_0,3),"Recall_1: ",round(rcall_1,3))) 
      initacc = acc 
    } 
  } 
} 
主站蜘蛛池模板: 西安市| 大英县| 麻栗坡县| 广饶县| 庄河市| 阆中市| 光泽县| 藁城市| 三穗县| 丰镇市| 卓尼县| 济南市| 丹寨县| 岫岩| 红原县| 台山市| 汾阳市| 锡林浩特市| 玛曲县| 玉龙| 阜新市| 浦城县| 梨树县| 景德镇市| 登封市| 夹江县| 武穴市| 兴宁市| 桂阳县| 阳朔县| 玉环县| 固镇县| 扎兰屯市| 洞头县| 广平县| 星座| 历史| 育儿| 潼南县| 恩施市| 和林格尔县|