- Advanced Machine Learning with R
- Cory Lesmeister Dr. Sunil Kumar Chinnamgari
- 271字
- 2021-06-24 14:24:39
Data creation
In this section, I'll discuss how I created the dataset used for this chapter and provide insight into the features and the class labels we'll endeavor to predict. The data is available on GitHub at https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R/blob/master/Data/sim_df.csv:
- Let's get our libraries and data loaded:
> library(magrittr)
> install.packages("glmnet")
> install.packages("caret")
> install.packages("classifierplots")
> install.packages("DataExplorer")
> install.packages("InformationValue")
> install.packages("Metrics")
> install.packages("ROCR")
> install.packages("tidyverse")
> options(scipen=999)
> sim_df <- readr::read_csv('sim_df.csv')
The dataframe is 10,000 observations of 17 variables, consisting of 16 input features and 1 response. I created this dataset using the twoClassSim() function from the caret package. The full code with seeds is available in the online code, allowing you to make changes and create whatever data you would like to explore. A full explanation of your options in creating your own set is available in the function's help.
- Now, let me go over the column names and tell you what this is all about:
> colnames(sim_df)
[1] "TwoFactor1" "TwoFactor2" "Linear1" "Linear2" "Linear3" "Linear4"
[7] "Linear5" "Linear6" "Nonlinear1" "Nonlinear2" "Nonlinear3" "Noise1"
[13] "Noise2" "Noise3" "Noise4" "Class" "random1"
First of all, the TwoFactor features are correlated with each other and slightly predictive of the response, y. Five of the six linear features, the three non-linear features, and the feature named random1 might have some predictive power. The four noise features should have absolutely no predictive power unless by pure chance. Also, the Linear5 and Linear6 features are highly correlated. I created that relationship to help point out how the different methods will handle it.
- The y labels are somewhat imbalanced, roughly 70/30:
> table(sim_df$y)
0 1
7072 2928
- The data isn't too wide to include all of it in a correlation plot:
> DataExplorer::plot_correlation(sim_df)
The output of the preceding code is as follows:

The plot confirms visually what I described previously. The highest correlation is between Linear5 and Linear6. What we can do is eliminate one of the pairs of highly correlated features, which I did in Chapter 2, Linear Regression. In this instance, we'll keep both in and let the algorithms handle it.
This data is fully prepared for modeling in this chapter, so let's begin.
- Learning SQL Server Reporting Services 2012
- Istio入門與實戰
- Learning AngularJS Animations
- 基于Proteus和Keil的C51程序設計項目教程(第2版):理論、仿真、實踐相融合
- Creating Dynamic UI with Android Fragments
- 電腦常見問題與故障排除
- Effective STL中文版:50條有效使用STL的經驗(雙色)
- 計算機組裝·維護與故障排除
- 計算機應用與維護基礎教程
- Learning Game Physics with Bullet Physics and OpenGL
- 嵌入式系統中的模擬電路設計
- OUYA Game Development by Example
- 微軟互聯網信息服務(IIS)最佳實踐 (微軟技術開發者叢書)
- 單片機開發與典型工程項目實例詳解
- Spring Cloud實戰