官术网_书友最值得收藏!

  • R Programming By Example
  • Omar Trejo Navarro
  • 561字
  • 2021-07-02 21:30:38

Data frames

Now we turn to data frames, which are a lot like spreadsheets or database tables. In scientific contexts, experiments consist of inpidual observations (rows), each of which involves several different variables (columns). Often, these variables contain different data types, which would not be possible to store in matrices since they must contain a single data type. A data frame is a natural way to represent such heterogeneous tabular data. Every element within a column must be of the same type, but different elements within a row may be of different types, that's why we say that a data frame is a heterogeneous data structure.

Technically, a data frame is a list whose elements are equal-length vectors, and that's why it permits heterogeneity.

Data frames are usually created by reading in a data using the read.table(), read.csv(), or other similar data-loading functions. However, they can also be created explicitly with the data.frame() function or they can be coerced from other types of objects such as lists. To create a data frame using the data.frame() function, note that we send a vector (which, as we know, must contain elements of a single type) to each of the column names we want our data frame to have, which are A, B, and C in this case. The data frame we create below has four rows (observations) and three variables, with numeric, character, and logical types, respectively. Finally, extract subsets of data using the matrix techniques we saw earlier, but you can also reference columns using the $ operator and then extract elements from them:

x <- data.frame(
    A = c(1, 2, 3, 4),
    B = c("D", "E", "F", "G"),
    C = c(TRUE, FALSE, NA, FALSE)
)
x[1, ]
#> A B C
#> 1 1 D TRUE
x[, 1]
#> [1] 1 2 3 4
x[1:2, 1:2]
#> A B
#> 1 1 D
#> 2 2 E
x$B
#> [1] D E F G
#> Levels: D E F G
x$B[2]
#> [1] E
#> Levels: D E F G

Depending on how the data is organized, the data frame is said to be in either wide or narrow formats (https://en.wikipedia.org/wiki/Wide_and_narrow_data). Finally, if you want to keep only observations for which you have complete cases, meaning only rows that don't contain any NA values for any of the variables, then you should use the complete.cases() function, which returns a logical vector of length equal to the number of rows, and which contains a TRUE value for those rows that don't have any NA values and FALSE for those that have at least one such value.

Note that when we created the x data frame, the C column contains an NA in its third value. If we use the complete.cases() function on x, then we will get a FALSE value for that row and a TRUE value for all others. We can then use this logical vector to subset the data frame just as we have done before with matrices. This can be very useful when analyzing data that may not be clean, and for which you only want to keep those observations for which you have full information:

x
#> A B C
#> 1 1 D TRUE
#> 2 2 E FALSE
#> 3 3 F NA
#> 4 4 G FALSE

complete.cases(x)
#> [1] TRUE TRUE FALSE TRUE
x[complete.cases(x), ]
#> A B C
#> 1 1 D TRUE
#> 2 2 E FALSE
#> 4 4 G FALSE
主站蜘蛛池模板: 全椒县| 中方县| 隆安县| 河西区| 敦煌市| 安顺市| 龙海市| 东至县| 政和县| 博客| 肇庆市| 营山县| 徐闻县| 马鞍山市| 周口市| 霍林郭勒市| 南宫市| 驻马店市| 鄂托克前旗| 新蔡县| 信丰县| 嵩明县| 云和县| 陆河县| 福安市| 锡林郭勒盟| 咸宁市| 惠安县| 布尔津县| 明星| 台州市| 庆安县| 吉林省| 山阴县| 日土县| 澄城县| 宁蒗| 明光市| 宁国市| 黎城县| 望都县|