官术网_书友最值得收藏!

Dataset creation

The data we use in this chapter can be downloaded from any source on the internet or from GitHub at this link: https://github.com/PacktPublishing/Advanced-Machine-Learning-with-R/tree/master/Chapter05.

I found this data on a website dedicated to providing datasets for support vector machine analysis. You can follow the following link to find numerous sets to test your learning methods: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

The authors have asked to cite their work, which I will abide by:

Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011

The data we're using is named a5a, consisting of the training data with 6414 observations. This is a sufficient size dataset for the interest of facilitating learning, and not causing computational speed issues. Also, when doing KNN or SVM, you need to center/scale or normalize your data to 0/1 if the input features are of different scales. Well, this data's input features are of just two levels, 0 or 1, so we can forgo any normalization efforts.

I'll show you how to load this data into R, and you can replicate that process on any data you desire to use.

While we're at it, we may as well load all of the packages needed for this chapter:

> library(magrittr)

> install.packages("ggthemes")

> install.packages("caret")

> install.packages("classifierplots")

> install.packages("DataExplorer")

> install.packages("e1071")

> install.packages("InformationValue")

> install.packages("kknn")

> install.packages("Matrix")

> install.packages("Metrics")

> install.packages("plm")

> install.packages("ROCR")

> install.packages("tidyverse")

> options(scipen=999)

It's a simple matter to access this data using R's download.file() function. You need to provide the link and give the file a name:


> download.file('https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/
a5a', 'chap5')

What's rather interesting now is that you can put this downloaded file into a usable format with a function created explicitly for this data from the e1071 library:

> df <- e1071::read.matrix.csr("chap5")

The df object is now an extensive list of input features, and the response labels structured as a factor with two levels (-1 and +1). This list is what is saved on GitHub in an R data file like this:

> saveRDS(df, file = "chapter05")

Let's look at how to turn this list into something usable, assuming we need to start by loading it into your environment:

> df <- readRDS("chapter05")

We'll create the classification labels in an object called y, and turn -1 into 0, and +1 into 1:

> y <- df$y

> y <- ifelse(y == "+1", 1, 0)

> table(y)
y
0 1
4845 1569

The table shows us that just under 25% of the labels are considered an event. What event? It doesn't matter for our purposes, so we can move on and produce a dataframe of the predictors called x. I tried a number of ways to put the sparse matrix into a dataframe, and it seems that the following code is the easiest, using a function from the Matrix package:

> x <- Matrix::as.matrix(df$x)

> x <- as.data.frame(x)

> dim(x)
[1] 6414 122

We now have our dataframe of 6,414 observations and 122 input features. Next, we'll create train/test sets and explore the features.

主站蜘蛛池模板: 盐津县| 嘉黎县| 邵武市| 新龙县| 福州市| 岳阳县| 文昌市| 大关县| 益阳市| 开平市| 寿宁县| 大同市| 庄河市| 琼中| 榕江县| 天台县| 东乡族自治县| 卢湾区| 临澧县| 河西区| 南城县| 平凉市| 黑山县| 方山县| 宁德市| 郎溪县| 蓬溪县| 泸溪县| 四川省| 天峨县| 玛纳斯县| 天津市| 阿拉善盟| 上杭县| 武安市| 利川市| 普陀区| 雅安市| 手游| 鸡东县| 凤凰县|