書名： Statistics for Machine Learning
作者名： Pratap Dangeti
本章字數： 722字
更新時間： 2021-07-02 19:06:02

Example of random forest using German credit data

The same German credit data is being utilized to illustrate the random forest model in order to provide an apple to apple comparison. A very significant difference anyone can observe compared with logistic regression is that effort applied on data preprocessing drastically decreases. The following differences are worth a mention:

In RF, we have not removed variables one by one from analysis based on significance and VIF values, as significance tests are not applicable for ML models. However five-fold cross validation has been performed on training data to ensure the model's robustness.
We have removed one extra dummy variable in the logistic regression procedure, whereas in RF we have not removed the extra dummy variable from the analysis, as the latter automatically takes care of multi-collinearity. In fact, the underlying single model on which ensemble has been built is a decision tree, for which multi-collinearity is not a problem at all. We will cover decision trees in depth in the next chapter.
Random forest requires much less human effort and intervention to train the model than logistic regression. This way of working makes ML models a favorite for software engineers to deploy them with much ease. Also, ML models can learn based on data automatically without much hassle.

Random forest applied on German credit data:

>>> import pandas as pd
>>> from sklearn.ensemble import RandomForestClassifier

>>> credit_data = pd.read_csv("credit_data.csv")
>>> credit_data['class'] = credit_data['class']-1

The creation of dummy variables step is similar to the logistic regression preprocessing step:

>>> dummy_stseca = pd.get_dummies(credit_data['Status_of_existing_checking_account'], prefix='status_exs_accnt')
>>> dummy_ch = pd.get_dummies(credit_data['Credit_history'], prefix='cred_hist')
>>> dummy_purpose = pd.get_dummies(credit_data['Purpose'], prefix='purpose')
>>> dummy_savacc = pd.get_dummies(credit_data['Savings_Account'], prefix='sav_acc')
>>> dummy_presc = pd.get_dummies(credit_data['Present_Employment_since'], prefix='pre_emp_snc')
>>> dummy_perssx = pd.get_dummies(credit_data['Personal_status_and_sex'], prefix='per_stat_sx')
>>> dummy_othdts = pd.get_dummies(credit_data['Other_debtors'], prefix='oth_debtors')
>>> dummy_property = pd.get_dummies(credit_data['Property'], prefix='property')
>>> dummy_othinstpln = pd.get_dummies(credit_data['Other_installment_plans'], prefix='oth_inst_pln')
>>> dummy_housing = pd.get_dummies(credit_data['Housing'], prefix='housing')
>>> dummy_job = pd.get_dummies(credit_data['Job'], prefix='job')
>>> dummy_telephn = pd.get_dummies(credit_data['Telephone'], prefix='telephn')
>>> dummy_forgnwrkr = pd.get_dummies(credit_data['Foreign_worker'], prefix='forgn_wrkr')

>>> continuous_columns = ['Duration_in_month', 'Credit_amount', 'Installment_rate_in_percentage_of_disposable_income', 'Present_residence_since','Age_in_years','Number_of_existing_credits_at_this_bank',
'Number_of_People_being_liable_to_provide_maintenance_for']

>>> credit_continuous = credit_data[continuous_columns]

In the following variables combination step, we have not removed the one extra dummy variable out of all the categorical variables. All dummy variables created for status_of_existing_checking_account variable have been used in random forest, rather than the one column that is removed in logistic regression, due to the representative nature of the variable with respect to all the other variables.

>>> credit_data_new = pd.concat([dummy_stseca, dummy_ch,dummy_purpose, dummy_savacc,dummy_presc,dummy_perssx,dummy_othdts, dummy_property, dummy_othinstpln,dummy_housing,dummy_job, dummy_telephn, dummy_forgnwrkr, credit_continuous,credit_data['class']],axis=1)

In the following example, data has been split 70-30. The reason is due to the fact that we would be performing five-fold cross-validation in grid search during training, which produces a similar effect of splitting the data into 50-25-25 of train, validation, and test datasets respectively.

>>> x_train,x_test,y_train,y_test = train_test_split( credit_data_new.drop( ['class'],axis=1),credit_data_new['class'],train_size = 0.7,random_state=42)

The random forest ML model is applied with assumed hyperparameter values, as follows:

Number of trees is 1000
Criterion of slitting is gini
Maximum depth each decision tree can grow is 100
Minimum observations required at each not to be eligible for splitting is 3
Minimum number of observations in tree node should be 2

However, optimum parameter values needs to be tuned using grid search:

>>> rf_fit = RandomForestClassifier( n_estimators=1000, criterion="gini", max_depth=100, min_samples_split=3,min_samples_leaf=2)
>>> rf_fit.fit(x_train,y_train)

>>> print ("\nRandom Forest -Train Confusion Matrix\n\n", pd.crosstab(y_train, rf_fit.predict( x_train),rownames = ["Actuall"],colnames = ["Predicted"]))
>>> print ("\n Random Forest - Train accuracy",round(accuracy_score( y_train, rf_fit.predict(x_train)),3))

>>> print ("\nRandom Forest - Test Confusion Matrix\n\n",pd.crosstab(y_test, rf_fit.predict(x_test),rownames = ["Actuall"],colnames = ["Predicted"]))
>>> print ("\nRandom Forest - Test accuracy",round(accuracy_score(y_test, rf_fit.predict(x_test)),3))

From observing the above results, the test accuracy produced from random forest is 0.855, which is much higher than the test accuracy of 0.8053 from logistic regression results, even after the careful tuning and removing insignificant and multi-collinear variables. This entire phenomenon boils down to the core theme of bias versus variance trade-off. Linear models are very robust and do not have enough variance to fit non-linearity in data, however, with ensemble techniques, we can minimize the variance error from a conventional decision tree, which produces the result with minimum errors from both bias and variance components.

The accuracy of the random forest can be further optimized by using the grid search method to obtain the optimum hyperparameters, for which accuracy could be much higher than the randomly chosen hyperparameters. In the next section, we will be covering the grid search method in detail.

官术网_书友最值得收藏!

Statistics for Machine Learning

Example of random forest using German credit data