- Advanced Machine Learning with R
- Cory Lesmeister Dr. Sunil Kumar Chinnamgari
- 386字
- 2021-06-24 14:24:33
Exploring categorical variables
When it comes to an understanding of your categorical variables, there're many different ways to go about it. We can easily use the base R table() function on a feature. If you just want to see how many distinct levels are in a feature, then dplyr works well. In this example, we examine type, which has three unique levels:
dplyr::count(gettysburg, dplyr::n_distinct(type))
The output of the preceding code is as follows:
# A tibble: 1 x 2
`dplyr::n_distinct(type)` n
<int> <int>
3 587
Let's now look at a way to explore all of the categorical features utilizing tidyverse principles. Doing it this way always allows you to save the tibble and examine the results in depth as needed. Here is a way of putting all categorical features into a separate tibble:
gettysburg_cat <-
gettysburg[, sapply(gettysburg, class) == 'character']
Using dplyr, you can now summarize all of the features and the number of distinct levels in each:
gettysburg_cat %>%
dplyr::summarise_all(dplyr::funs(dplyr::n_distinct(.)))
The output of the preceding code is as follows:
# A tibble: 1 x 9
type state regiment_or_battery brigade division corps army july1_Commander Cdr_casualty
<int> <int> <int> <int> <int> <int> <int> <int> <int>
3 30 275 124 38 14 2 586 6
Notice that there're 586 distinct values to july1_Commander. This means that two of the unit Commanders have the same rank and last name. We can also surmise that this feature will be of no value to any further analysis, but we'll deal with that issue in a couple of sections ahead.
Suppose we're interested in the number of observations for each of the levels for the Cdr_casualty feature. Yes, we could use table(), but how about producing the output as a tibble as discussed before? Give this code a try:
gettysburg_cat %>%
dplyr::group_by(Cdr_casualty) %>%
dplyr::summarize(num_rows = n())
The output of the preceding code is as follows:
# A tibble: 6 x 2
Cdr_casualty num_rows
<chr> <int>
1 captured 6
2 killed 29
3 mortally wounded 24
4 no 405
5 wounded 104
6 wounded-captured 19
Speaking of tables, let's look at a tibble-friendly way of producing one using two features. This code takes the idea of comparing commander casualties by army:
gettysburg_cat %>%
janitor::tabyl(army, Cdr_casualty)
The output of the preceding code is as follows:
army captured killed mortally wounded no wounded wounded-captured
Confederate 2 15 13 165 44 17
Union 4 14 11 240 60 2
Explore the data on your own and, once you're comfortable with the categorical variables, let's tackle the issue of missing values.
- 用“芯”探核:龍芯派開發實戰
- 零點起飛學Xilinx FPG
- 基于Proteus和Keil的C51程序設計項目教程(第2版):理論、仿真、實踐相融合
- 計算機組裝·維護與故障排除
- The Applied AI and Natural Language Processing Workshop
- Mastering Manga Studio 5
- VCD、DVD原理與維修
- 微軟互聯網信息服務(IIS)最佳實踐 (微軟技術開發者叢書)
- Visual Media Processing Using Matlab Beginner's Guide
- OpenGL Game Development By Example
- Source SDK Game Development Essentials
- 基于PROTEUS的電路設計、仿真與制板
- WebGL Hotshot
- Arduino項目開發:智能生活
- 微控制器的應用