Merging data enables us to understand how different data sources relate to each other. The merge operation in R is similar to the join operation in a database, which combines fields from two datasets using values that are common to each.
Getting ready
Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.
How to do it…
Perform the following steps to merge salaries and employees:
As employees and salaries are common in emp_no, we can merge these two datasets using emp_no as the join key:
In addition to the merge function, we can install and load the plyr package to manipulate data:
> install.packages("plyr")> library(plyr)
Besides the standard merge function, we can use the join function in plyr to merge data:
> join(employees, salaries, by="emp_no")
How it works…
Similarly to data tables in a database, we sometimes need to combine two datasets for the purpose of correlating data. In R, we can simply combine two different data frames with common values using the merge function.
In the merge function, we use both salaries and employees as our input data frame. For the by parameter, we can specify emp_no as the key to join these two tables. We will then see that the data with the same emp_no value has now merged into a new data frame. However, sometimes we want to perform either a left join or a right join for the purpose of preserving every data value from either employees or salaries. To perform the left join, we can set all.x to TRUE. Then, we can find every row from the employees dataset preserved in the merged dataset. On the other hand, if one wants to preserve all rows from the salaries dataset, we can set all.y to TRUE.
In addition to using the built-in merge function, we can install the plyr package to merge datasets. The usage of join is very similar to merge; we only have to specify the data to join and the columns with the common values within the by parameter.
There's more…
In the plyr package, we can use the join_all function to join recursive datasets within a list. Here, we can use join_all to join the employees and salaries datasets by emp_no: