官术网_书友最值得收藏!

  • R Programming By Example
  • Omar Trejo Navarro
  • 534字
  • 2021-07-02 21:30:37

Factors

When analyzing data, it's quite common to encounter categorical values. R provides a good way to represent categorical values using factors, which are created using the factor() function and are integer vectors with associated labels for each integer. The different values that the factor can take are called levels. The levels() function shows all the levels from a factor, and the levels parameter of the factor() function can be used to explicitly define their order, which is alphabetical in case it's not explicitly defined.

Note that defining an explicit order can be important in linear modeling because the first level is used as the baseline level for functions like lm() (linear models), which we will use in Chapter 3, Predicting Votes with Linear Models.

Furthermore, printing a factor shows slightly different information than printing a character vector. In particular, note that the quotes are not shown and that the levels are explicitly printed in order afterwards:

x <- c("Blue", "Red", "Black", "Blue")
y <- factor(c("Blue", "Red", "Black", "Blue"))
z <- factor(c("Blue", "Red", "Black", "Blue"), 
levels=c("Red", "Black", "Blue")) x #> [1] "Blue" "Red" "Black" "Blue"
y
#> [1] Blue Red Black Blue
#> Levels: Black Blue Red
z
#> [1] Blue Red Black Blue
#> Levels: Red Black Blue
levels(y)
#> [1] "Black" "Blue" "Red"
levels(z)
#> [1] "Red" "Black" "Blue"

Factors can sometimes be tricky to work with because their types are interpreted differently depending on what function is used to operate on them. Remember the class() and typeof() functions we used before? When used on factors, they may produce unexpected results. As you can see below, the class() function will identify x and y as being character and factor, respectively. However, the typeof() function will let us know that they are character and integer, respectively. Confusing isn't it? This happens because, as we mentioned, factors are stored internally as integers, and use a mechanism similar to look-up tables to retrieve the actual string associated for each one.

Technically, the way factors store the strings associated with their integer values is through attributes, which is a topic we will touch on in Chapter 8Object-Oriented System to Track Cryptocurrencies.

class(x)
#> [1] "character"
class(y)
#> [1] "factor"
typeof(x)
#> [1] "character"
typeof(y)
#> [1] "integer"

While factors look and often behave like character vectors, as we mentioned, they are actually integer vectors, so be careful when treating them like strings. Some string methods, like gsub() and grepl(), will coerce factors to characters, while others, like nchar(), will throw an error, and still others, like c(), will use the underlying integer values. For this reason, it's usually best to explicitly convert factors to the data type you need:

gsub("Black", "White", x)
#> [1] "Blue" "Red" "White" "Blue"
gsub("Black", "White", y)
#> [1] "Blue" "Red" "White" "Blue"
nchar(x)
#> [1] 4 3 5 4
nchar(y)
#> Error in nchar(y): 'nchar()' requires a character vector
c(x)
#> [1] "Blue" "Red" "Black" "Blue"
c(y)
#> [1] 2 3 1 2

If you did not notice, the nchar() applied itself to each of the elements in the x factor. The "Blue", "Red", and "Black" strings have 4, 3, and 5 characters, respectively. This is another example of the vectorized operations we mentioned in the vectors section earlier.

主站蜘蛛池模板: 勃利县| 鲁甸县| 泽普县| 五大连池市| 惠安县| 达日县| 扬州市| 双桥区| 柘城县| 建水县| 清涧县| 荃湾区| 体育| 正蓝旗| 翁牛特旗| 鹤峰县| 万载县| 尤溪县| 景德镇市| 双城市| 棋牌| 红原县| 会昌县| 大邑县| 株洲县| 二连浩特市| 岳普湖县| 博白县| 台安县| 阿克苏市| 长沙县| 新平| 莱州市| 治多县| 昔阳县| 谷城县| 新巴尔虎右旗| 托克逊县| 饶平县| 保德县| 耒阳市|