官术网_书友最值得收藏!

  • Deep Learning with R for Beginners
  • Mark Hodnett Joshua F. Wiley Yuxi (Hayden) Liu Pablo Maldonado
  • 977字
  • 2021-06-24 14:30:45

The binary classification model

The code from the previous section creates a new file called predict.csv in the dunnhumby folder. This dataset has a single row for each customer with a 0/1 field indicating whether they visited in the last two weeks and predictor variables based on sales data before those two weeks. Now we can proceed to build some machine learning models. The Chapter4/binary_predict.R file contains the code for our first prediction task, binary classification. The first part of the code loads the data and creates an array of  predictor variables by including all columns except the customer ID, the binary classification predictor variable, and the regression predictor variable. The feature columns are all numeric fields that are heavily right-skewed distributed, so we apply a log transformation to those fields. We add 0.01 first to avoid getting a non-numeric result from attempting to get a log of a zero value (log(0)= -Inf).

The following plot shows the data before transformation, on the left, and the data after transformation, on the right:

Figure 4.5: Distribution of a feature variable before and after transformation.

The large bar on the left in the second plot is where the original field was zero (log(0+0.01) = -4.6). The following code loads the data, performs the log transformation, and creates the previous plot:

set.seed(42)
fileName <- "../dunnhumby/predict.csv"
dfData <- read_csv(fileName,
col_types = cols(
.default = col_double(),
CUST_CODE = col_character(),
Y_categ = col_integer())
)
nobs <- nrow(dfData)
train <- sample(nobs, 0.9*nobs)
test <- setdiff(seq_len(nobs), train)
predictorCols <- colnames(dfData)[!(colnames(dfData) %in% c("CUST_CODE","Y_numeric","Y_categ"))]

# data is right-skewed, apply log transformation
qplot(dfData$Y_numeric, geom="histogram",binwidth=10,
main="Y value distribution",xlab="Spend")+theme(plot.title = element_text(hjust = 0.5))
dfData[, c("Y_numeric",predictorCols)] <- log(0.01+dfData[, c("Y_numeric",predictorCols)])
qplot(dfData$Y_numeric, geom="histogram",binwidth=0.5,
main="log(Y) value distribution",xlab="Spend")+theme(plot.title = element_text(hjust = 0.5))
trainData <- dfData[train, c(predictorCols)]
testData <- dfData[test, c(predictorCols)]
trainData$Y_categ <- dfData[train, "Y_categ"]$Y_categ
testData$Y_categ <- dfData[test, "Y_categ"]$Y_categ

Before we train a deep learning model, we train three machine learning models – a logistic regression model, a Random Forest model, and an XGBoost model – on the data as a benchmark. This code section contains the data load, transformation, and three models:

#Logistic Regression Model
logReg=glm(Y_categ ~ .,data=trainData,family=binomial(link="logit"))
pr <- as.vector(ifelse(predict(logReg, type="response",
testData) > 0.5, "1", "0"))
# Generate the confusion matrix showing counts.
t<-table(dfData[test, c(predictorCols, "Y_categ")]$"Y_categ", pr,
dnn=c("Actual", "Predicted"))
acc<-round(100.0*sum(diag(t))/length(test),2)
print(t)
Predicted
Actual 0 1
0 130 42
1 48 174
print(sprintf(" Logistic regression accuracy = %1.2f%%",acc))
[1] " Logistic regression accuracy = 77.16%"
rm(t,pr,acc)

rf <- randomForest::randomForest(as.factor(Y_categ) ~ .,
data=trainData,
na.action=randomForest::na.roughfix)
pr <- predict(rf, newdata=testData, type="class")
# Generate the confusion matrix showing counts.
t<-table(dfData[test, c(predictorCols, "Y_categ")]$Y_categ, pr,
dnn=c("Actual", "Predicted"))
acc<-round(100.0*sum(diag(t))/length(test),2)
print(t)
Predicted
Actual 0 1
0 124 48
1 30 192
print(sprintf(" Random Forest accuracy = %1.2f%%",acc))
[1] " Random Forest accuracy = 80.20%"
rm(t,pr,acc)

xgb <- xgboost(data=data.matrix(trainData[,predictorCols]), label=trainData[,"Y_categ"]$Y_categ,
nrounds=75, objective="binary:logistic")
pr <- as.vector(ifelse(
predict(xgb, data.matrix(testData[, predictorCols])) > 0.5, "1", "0"))
t<-table(dfData[test, c(predictorCols, "Y_categ")]$"Y_categ", pr,
dnn=c("Actual", "Predicted"))
acc<-round(100.0*sum(diag(t))/length(test),2)
print(t)
Predicted
Actual 0 1
0 125 47
1 44 178
print(sprintf(" XGBoost accuracy = %1.2f%%",acc))
[1] " XGBoost accuracy = 76.90%"
rm(t,pr,acc)

We create logistic regression, Random Forest, and XGBoost models for a number of reasons. Firstly, most of the work is already done in preparing the data, so it is trivial to do so. Secondly, it gives us a benchmark to compare our deep learning model to. Thirdly, if there were a problem in the data-preparation tasks, these machine learning algorithms would highlight these problems more rapidly because they will be quicker than training a deep learning model. In this case, we only have a few thousand records, so these machine learning algorithms will easily run on this data. If the data were too large for these algorithms, I would consider taking a smaller sample and running our benchmark tasks on that smaller sample. There are many machine learning algorithms to choose from, but I used these algorithms as benchmarks for the following reasons:

  • Logistic regression is a basic model and is always a good benchmark to use
  • Random Forest is known to train well using the default parameters and is robust to overfitting and correlated variables (which we have here)
  • XGBoost is consistently rated as the one of the best-performing machine learning algorithms

All three algorithms achieve a similar amount of accuracy, the highest accuracy was achieved by Random Forest with an 80.2% accuracy. We now know that this dataset is suitable for prediction tasks and we have a benchmark to compare against.

Now we will build a deep learning model using MXNet:

require(mxnet)

# MXNet expects matrices
train_X <- data.matrix(trainData[, predictorCols])
test_X <- data.matrix(testData[, predictorCols])
train_Y <- trainData$Y_categ

# hyper-parameters
num_hidden <- c(128,64,32)
drop_out <- c(0.2,0.2,0.2)
wd=0.00001
lr <- 0.03
num_epochs <- 40
activ <- "relu"

# create our model architecture
# using the hyper-parameters defined above
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=num_hidden[1])
act1 <- mx.symbol.Activation(fc1, name="activ1", act_type=activ)

drop1 <- mx.symbol.Dropout(data=act1,p=drop_out[1])
fc2 <- mx.symbol.FullyConnected(drop1, name="fc2", num_hidden=num_hidden[2])
act2 <- mx.symbol.Activation(fc2, name="activ2", act_type=activ)

drop2 <- mx.symbol.Dropout(data=act2,p=drop_out[2])
fc3 <- mx.symbol.FullyConnected(drop2, name="fc3", num_hidden=num_hidden[3])
act3 <- mx.symbol.Activation(fc3, name="activ3", act_type=activ)

drop3 <- mx.symbol.Dropout(data=act3,p=drop_out[3])
fc4 <- mx.symbol.FullyConnected(drop3, name="fc4", num_hidden=2)
softmax <- mx.symbol.SoftmaxOutput(fc4, name="sm")

# run on cpu, change to 'devices <- mx.gpu()'
# if you have a suitable GPU card
devices <- mx.cpu()
mx.set.seed(0)
tic <- proc.time()
# This actually trains the model
model <- mx.model.FeedForward.create(softmax, X = train_X, y = train_Y,
ctx = devices,num.round = num_epochs,
learning.rate = lr, momentum = 0.9,
eval.metric = mx.metric.accuracy,
initializer = mx.init.uniform(0.1),
wd=wd,
epoch.end.callback = mx.callback.log.train.metric(1))
print(proc.time() - tic)
user system elapsed
9.23 4.65 4.37

pr <- predict(model, test_X)
pred.label <- max.col(t(pr)) - 1
t <- table(data.frame(cbind(testData[,"Y_categ"]$Y_categ,pred.label)),
dnn=c("Actual", "Predicted"))
acc<-round(100.0*sum(diag(t))/length(test),2)
print(t)
Predicted
Actual 0 1
0 136 36
1 54 168
print(sprintf(" Deep Learning Model accuracy = %1.2f%%",acc))
[1] " Deep Learning Model accuracy = 77.16%"
rm(t,pr,acc)
rm(data,fc1,act1,fc2,act2,fc3,act3,fc4,softmax,model)

The deep learning model achieved a 77.16% accuracy on the test data, which is only beaten by the Random Forest model. This shows that a deep learning model can be competitive against the best machine learning algorithms. It also shows that deep learning models on classification tasks do not always beat other machine learning algorithms. We used these models to provide a benchmark, so that we would know that our deep learning model was getting decent results; it gives us confidence that our deep learning model is competitive.

Our deep learning model uses 20% dropout in each layer and weight decay for regularization. Without dropout, the model overtrained significantly. This was probably because the features are highly correlated, as our columns are the spend in various departments. It figures that if one column is for a type of bread, and another column is for a type of milk, then these change together, namely someone who has more transactions and spends more is likely to buy both.

主站蜘蛛池模板: 观塘区| 永康市| 津南区| 海林市| 彭山县| 宁城县| 莲花县| 梁平县| 澄江县| 景谷| 广德县| 仙桃市| 安塞县| 永嘉县| 册亨县| 白玉县| 孝感市| 嘉义县| 昌黎县| 海伦市| 巨野县| 石嘴山市| 密山市| 京山县| 常宁市| 都江堰市| 高阳县| 突泉县| 哈尔滨市| 东台市| 泸州市| 盐津县| 临沭县| 宁德市| 西丰县| 子洲县| 金阳县| 绥芬河市| 科尔| 苍梧县| 鸡西市|