官术网_书友最值得收藏!

Reshaping data

Reshaping data is similar to creating a contingency table, which enables the user to aggregate data of specific values. The reshape2 package is designed for this specific purpose. Here, we introduce how to use the reshape2 package to transform our dataset from long to wide format with the dcast function. We also cover how to transform it from wide format back to long format with the melt function.

Getting ready

Refer to the Merging data recipe and merge employees and salaries into employees_salary.

How to do it…

Perform the following steps to reshape data:

  1. First, we can use the dcast function to transform data from long to wide:
    > wide_salaries <- dcast(salaries, emp_no ~ year(ymd(from_date)), value.var="salary")
    > wide_salaries[1:3, 1:7]
     emp_no 1985 1986 1987 1988 1989 1990
    1 10001 NA 60117 62102 66074 66596 66961
    2 10002 NA NA NA NA NA NA
    3 10003 NA NA NA NA NA NA
    
  2. We can also transform the data by keeping emp_no and the formatted name string as two of the attributes. Then, we can shape the year of salary payment as the column name and shape salary as its value:
    > wide_employees_salary <- dcast(employees_salary, emp_no + paste(first_name, last_name) ~ year(ymd(from_date)), value.var="salary", variable.name="condition")
    > wide_employees_salary[1:3,1:7]
     emp_no paste(first_name, last_name) 1985 1986 1987 1988 1989
    1 10001 Georgi Facello NA 60117 62102 66074 66596
    2 10002 Bezalel Simmel NA NA NA NA NA
    3 10003 Parto Bamford NA NA NA NA NA
    
  3. On the other hand, we can transform the data from wide back to long format using the melt function:
    > long_salaries <- melt(wide_salaries, id.vars=c("emp_no"))
    > head(long_salaries)
     emp_no variable value
    1 10001 1985 NA
    2 10002 1985 NA
    3 10003 1985 NA
    4 10004 1985 NA
    5 10005 1985 NA
    6 10006 1985 NA
    
  4. To remove data with missing values in long_salaries, we can use na.omit to remove this data:
    > head(na.omit(long_salaries))
     emp_no variable value
    9 10009 1985 60929
    13 10013 1985 40000
    48 10048 1985 40000
    64 10064 1985 40000
    70 10070 1985 55999
    98 10098 1985 40000
    

How it works…

In this recipe, we demonstrated how to reshape data with the reshape2 package. First, we used the dcast function to transform data from the long to wide format. Using this function, we specified that we wanted the data to transform as salaries in the first parameter. Then, we specified the shaping formula by setting emp_no as row and the year of salary paid as the column. Finally, we set salary as presented value in wide format layout.

We can also shape the data with multiple columns; all that we need to do is add another column information on the left-hand side of the formula with a + operator. Thus, in our second case, we set the shaping formula as emp_no + paste(first_name, last_name) ~ year(ymd(from_date)) on the employees_salary merged data. Then, we saw that the output data was presented with emp_no and the formatted name on the left-hand side, and the year of salary paid on the upper side and the presented salary as the value.

Besides shaping the data from long format to wide format, we can also reshape the data back to long format using the melt function. Therefore, we can transform the wide_salaries data back to long format using emp_no as the basis. As there are many missing values (presented as NA), we can use the na.omit function to remove these records.

There's more…

In addition to the dcast and melt functions in the plyr package, we can use stack and unstack to ungroup and group values:

  1. We can group data by value using the unstack function:
    > un_salaries <- unstack(long_salaries[,c(3,1)])
    > head(un_salaries, 3)
    $`10001`
     [1] 60117 62102 66074 66596 66961 71046 74333 75286 75994 76884 80013
    [12] 81025 81097 84917 85112 85097 88958
    
    $`10002`
    [1] 65828 65909 67534 69366 71963 72527
    
    $`10003`
    [1] 40006 43616 43466 43636 43478 43699 43311
    
  2. In contrast, we can concatenate multiple data frames or lists using the stack function:
    > stack_salaries <- stack(un_salaries )
    > head(stack_salaries)
     values ind
    1 60117 10001
    2 62102 10001
    3 66074 10001
    4 66596 10001
    5 66961 10001
    6 71046 10001
    
主站蜘蛛池模板: 运城市| 长寿区| 金平| 乐业县| 阜新市| 哈巴河县| 长沙市| 塔城市| 衡东县| 林西县| 定州市| 科尔| 阿合奇县| 高州市| 阿克苏市| 富锦市| 福安市| 通海县| 车致| 嘉定区| 鄂尔多斯市| 彭阳县| 正宁县| 鹿泉市| 科技| 丹阳市| 新宁县| 潍坊市| 什邡市| 巫溪县| 萨嘎县| 宣城市| 南漳县| 将乐县| 墨江| 罗平县| 高陵县| 桓台县| 嘉荫县| 林芝县| 虹口区|