- Advanced Machine Learning with R
- Cory Lesmeister Dr. Sunil Kumar Chinnamgari
- 386字
- 2021-06-24 14:24:33
Exploring categorical variables
When it comes to an understanding of your categorical variables, there're many different ways to go about it. We can easily use the base R table() function on a feature. If you just want to see how many distinct levels are in a feature, then dplyr works well. In this example, we examine type, which has three unique levels:
dplyr::count(gettysburg, dplyr::n_distinct(type))
The output of the preceding code is as follows:
# A tibble: 1 x 2
`dplyr::n_distinct(type)` n
<int> <int>
3 587
Let's now look at a way to explore all of the categorical features utilizing tidyverse principles. Doing it this way always allows you to save the tibble and examine the results in depth as needed. Here is a way of putting all categorical features into a separate tibble:
gettysburg_cat <-
gettysburg[, sapply(gettysburg, class) == 'character']
Using dplyr, you can now summarize all of the features and the number of distinct levels in each:
gettysburg_cat %>%
dplyr::summarise_all(dplyr::funs(dplyr::n_distinct(.)))
The output of the preceding code is as follows:
# A tibble: 1 x 9
type state regiment_or_battery brigade division corps army july1_Commander Cdr_casualty
<int> <int> <int> <int> <int> <int> <int> <int> <int>
3 30 275 124 38 14 2 586 6
Notice that there're 586 distinct values to july1_Commander. This means that two of the unit Commanders have the same rank and last name. We can also surmise that this feature will be of no value to any further analysis, but we'll deal with that issue in a couple of sections ahead.
Suppose we're interested in the number of observations for each of the levels for the Cdr_casualty feature. Yes, we could use table(), but how about producing the output as a tibble as discussed before? Give this code a try:
gettysburg_cat %>%
dplyr::group_by(Cdr_casualty) %>%
dplyr::summarize(num_rows = n())
The output of the preceding code is as follows:
# A tibble: 6 x 2
Cdr_casualty num_rows
<chr> <int>
1 captured 6
2 killed 29
3 mortally wounded 24
4 no 405
5 wounded 104
6 wounded-captured 19
Speaking of tables, let's look at a tibble-friendly way of producing one using two features. This code takes the idea of comparing commander casualties by army:
gettysburg_cat %>%
janitor::tabyl(army, Cdr_casualty)
The output of the preceding code is as follows:
army captured killed mortally wounded no wounded wounded-captured
Confederate 2 15 13 165 44 17
Union 4 14 11 240 60 2
Explore the data on your own and, once you're comfortable with the categorical variables, let's tackle the issue of missing values.
- Deep Learning with PyTorch
- 硬件產(chǎn)品經(jīng)理成長(zhǎng)手記(全彩)
- Large Scale Machine Learning with Python
- 計(jì)算機(jī)組裝與維修技術(shù)
- Machine Learning with Go Quick Start Guide
- Arduino BLINK Blueprints
- 固態(tài)存儲(chǔ):原理、架構(gòu)與數(shù)據(jù)安全
- USB應(yīng)用分析精粹:從設(shè)備硬件、固件到主機(jī)端程序設(shè)計(jì)
- Learning Less.js
- UML精粹:標(biāo)準(zhǔn)對(duì)象建模語(yǔ)言簡(jiǎn)明指南(第3版)
- 基于S5PV210處理器的嵌入式開(kāi)發(fā)完全攻略
- 筆記本電腦現(xiàn)場(chǎng)維修實(shí)錄
- ARM接口編程
- Spring微服務(wù)實(shí)戰(zhàn)(第2版)
- 新型復(fù)印機(jī)·傳真機(jī)維修數(shù)據(jù)速查寶典