官术网_书友最值得收藏!

Merging data

Merging data enables us to understand how different data sources relate to each other. The merge operation in R is similar to the join operation in a database, which combines fields from two datasets using values that are common to each.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to merge salaries and employees:

  1. As employees and salaries are common in emp_no, we can merge these two datasets using emp_no as the join key:
    > employees_salary <- merge(employees, salaries, by="emp_no")
    > head(employees_salary,3)
     emp_no birth_date first_name last_name salary from_date to_date
    1 10001 1953-09-02 Georgi Facello 60117 1986-06-26 1987-06-26
    2 10001 1953-09-02 Georgi Facello 62102 1987-06-26 1988-06-25
    3 10001 1953-09-02 Georgi Facello 66596 1989-06-25 1990-06-25
    
  2. Or, we can assign NULL to the attribute that we want to drop:
    > merge(employees, salaries, by="emp_no", all.x =TRUE)
    
  3. In addition to the merge function, we can install and load the plyr package to manipulate data:
    > install.packages("plyr")
    > library(plyr)
    
  4. Besides the standard merge function, we can use the join function in plyr to merge data:
    > join(employees, salaries, by="emp_no")
    

How it works…

Similarly to data tables in a database, we sometimes need to combine two datasets for the purpose of correlating data. In R, we can simply combine two different data frames with common values using the merge function.

In the merge function, we use both salaries and employees as our input data frame. For the by parameter, we can specify emp_no as the key to join these two tables. We will then see that the data with the same emp_no value has now merged into a new data frame. However, sometimes we want to perform either a left join or a right join for the purpose of preserving every data value from either employees or salaries. To perform the left join, we can set all.x to TRUE. Then, we can find every row from the employees dataset preserved in the merged dataset. On the other hand, if one wants to preserve all rows from the salaries dataset, we can set all.y to TRUE.

In addition to using the built-in merge function, we can install the plyr package to merge datasets. The usage of join is very similar to merge; we only have to specify the data to join and the columns with the common values within the by parameter.

There's more…

In the plyr package, we can use the join_all function to join recursive datasets within a list. Here, we can use join_all to join the employees and salaries datasets by emp_no:

> join_all(list(employees, salaries), "emp_no")
主站蜘蛛池模板: 普宁市| 剑川县| 柳州市| 施秉县| 新和县| 玉环县| 邻水| 钟山县| 吉水县| 商都县| 利津县| 德化县| 巴马| 南澳县| 贺州市| 和平区| 什邡市| 徐州市| 攀枝花市| 浦东新区| 洛南县| 临海市| 虹口区| 绥江县| 长岭县| 丽江市| 康马县| 新绛县| 望江县| 上林县| 吉首市| 舟曲县| 安平县| 抚顺县| 棋牌| 临汾市| 吉林市| 阿克陶县| 左权县| 韩城市| 米脂县|