官术网_书友最值得收藏!

Hypothyroid

The hypothyroid dataset Hypothyroid.csv is available in the book's code bundle packet, located at /…/Chapter01/Data. While we have 26 variables in the dataset, we will only be using seven of these variables. Here, the number of observations is n = 3163. The dataset is downloaded from http://archive.ics.uci.edu/ml/datasets/thyroid+disease and the filename is hypothyroid.data (http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data). After some tweaks to the order of relabeling certain values, the CSV file is made available in the book's code bundle. The purpose of the study is to classify a patient with a thyroid problem based on the information provided by other variables. There are multiple variants of the dataset and the reader can delve into details at the following web page: http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/HELLO. Here, the column representing the variable of interest is named Hypothyroid, which shows that we have 151 patients with thyroid problems. The remaining 3012 tested negative for it. Clearly, this dataset is an example of unbalanced data, which means that one of the two cases is outnumbered by a huge number; for each thyroid case, we have about 20 negative cases. Such problems need to be handled differently, and we need to get into the subtleties of the algorithms to build meaningful models. The additional variables or covariates that we will use while building the predictive models include Age, Gender, TSH, T3, TT4, T4U, and FTI. The data is first imported into an R session and is subset according to the variables of interest as follows:

> HT <- read.csv("../Data/Hypothyroid.csv",header = TRUE,stringsAsFactors = F)
> HT$Hypothyroid <- as.factor(HT$Hypothyroid)
> HT2 <- HT[,c("Hypothyroid","Age","Gender","TSH","T3","TT4","T4U","FTI")]

The first line of code imports the data from the Hypothyroid.csv file using the read.csv function. The dataset now has a lot of missing data in the variables, as seen here:

> sapply(HT2,function(x) sum(is.na(x)))
Hypothyroid         Age      Gender         TSH          T3         TT4 
          0         446          73         468         695         249 
        T4U         FTI 
        248         247 

Consequently, we remove all the rows that have a missing value, and then split the data into training and testing datasets. We will also create a formula for the classification problem:

> HT2 <- na.omit(HT2)
> set.seed(12345)
> Train_Test <- sample(c("Train","Test"),nrow(HT2),replace=TRUE, prob=c(0.7,0.3))
> head(Train_Test)
[1] "Test"  "Test"  "Test"  "Test"  "Train" "Train"
> HT2_Train <- HT2[Train_Test=="Train",]
> HT2_TestX <- within(HT2[Train_Test=="Test",],rm(Hypothyroid))
> HT2_TestY <- HT2[Train_Test=="Test",c("Hypothyroid")]
> HT2_Formula <- as.formula("Hypothyroid~.")

The set.seed function ensures that the results are reproducible each time we run the program. After removing the missing observations with the na.omit function, we split the hypothyroid data into training and testing parts. The former is used to build the model and the latter is used to validate it, using data that has not been used to build the model. Quinlan – the inventor of the popular tree algorithm C4.5 – used this dataset extensively.

主站蜘蛛池模板: 五华县| 绩溪县| 天台县| 奉节县| 改则县| 包头市| 九江市| 喜德县| 莱阳市| 凤山县| 高青县| 嵩明县| 江津市| 上林县| 和田县| 株洲县| 怀来县| 特克斯县| 高雄县| 青冈县| 沁源县| 贡觉县| 武冈市| 贵南县| 阜阳市| 苏尼特左旗| 巩义市| 铜陵市| 罗江县| 昌宁县| 金平| 金门县| 仁化县| 海伦市| 双牌县| 杭州市| 西盟| 托里县| 辉县市| 五峰| 阿合奇县|