- Mastering Machine Learning with R
- Cory Lesmeister
- 378字
- 2021-07-02 13:46:23
Training a logistic regression algorithm
Follow these simple steps to train a logistic regression algorithm:
- The first step is to make sure we load our packages and call the magrittr library into our environment:
> library(magrittr)
> install.packages("caret")
> install.packages("classifierplots")
> install.packages("earth")
> install.packages("Information")
> install.packages("InformationValue")
> install.packages("Metrics")
> install.packages("tidyverse")
- Here, we load the file then check the dimensions and examine a table of the customer labels:
> santander <- read.csv("~/santander_prepd.csv")
> dim(santander)
[1] 76020 143
> table(santander$y)
0 1
73012 3008
We have 76,020 observations, but only 3,008 customers are labeled 1, which means dissatisfied. I'm going to use caret next to create training and test sets with an 80/20 split.
- Within caret's createDataPartition() function, it automatically stratifies the sample based on the response, so we can rest assured about having a balanced percentage between the train and test sets:
> set.seed(1966)
> trainIndex <- caret::createDataPartition(santander$y, p = 0.8, list = FALSE)
> train <- santander[trainIndex, ]
> test <- santander[-trainIndex, ]
- Let's see how the response is balanced between the two datasets:
> table(train$y)
0 1
58411 2405
> table(test$y)
0 1
14601 603
There are roughly 4 percent in each set, so we can proceed. One interesting thing that can happen when you split the data is that you now end up with what was a near zero variance feature becoming a zero variance feature in your training set. When I treated this data, I only took out the zero variance features.
- There were some low variance features, so let's see if we can eliminate some new zero variance ones:
> train_zero <- caret::nearZeroVar(train, saveMetrics = TRUE)
> table(train_zero$zeroVar)
FALSE TRUE
142 1
- OK, one feature is now zero variance because of the split, and we can remove it:
> train <- train[, train_zero$zeroVar == 'FALSE']
Our data frame now has 139 input features and the column of labeled customers. As we did with linear regression, for logistic regression to have meaningful results, which is to say not to overfit, you need to reduce the number of input features. We could press forward with stepwise selection or the like, as we did in the previous chapter. We could implement feature regularization methods as we'll discuss in the next chapter. However, I want to introduce a univariate feature reduction method using Weight Of Evidence (WOE) and Information Value (IV) and discuss how we can get an understanding of how to use it in a classification problem in conjunction with logistic regression.
- Big Data Analytics with Hadoop 3
- 大數(shù)據(jù)戰(zhàn)爭:人工智能時代不能不說的事
- 大數(shù)據(jù)項目管理:從規(guī)劃到實現(xiàn)
- Natural Language Processing Fundamentals
- Julia 1.0 Programming
- 塊數(shù)據(jù)5.0:數(shù)據(jù)社會學(xué)的理論與方法
- Enterprise PowerShell Scripting Bootcamp
- 單片機技術(shù)一學(xué)就會
- 新編計算機圖形學(xué)
- 基于神經(jīng)網(wǎng)絡(luò)的監(jiān)督和半監(jiān)督學(xué)習(xí)方法與遙感圖像智能解譯
- 人工智能:語言智能處理
- SMS 2003部署與操作深入指南
- Linux系統(tǒng)下C程序開發(fā)詳解
- Learn QGIS
- PostgreSQL 10 High Performance