官术网_书友最值得收藏!

Filtering data

Data filtering is the most common requirement for users who want to analyze partial data of interest rather than the whole dataset. In database operations, we can use a SQL command with a where clause to subset the data. In R, we can simply use the square bracket to perform filtering.

Getting ready

Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees and salaries datasets by following the steps from the Renaming the data variable recipe.

How to do it…

Perform the following steps to filter data:

  1. First, use head and tail to subset the first three rows and last three rows from the employees dataset:
    > head(employees, 3)
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    2 10002 1964-06-02 Bezalel Simmel F 1985-11-21
    3 10003 1959-12-03 Parto Bamford M 1986-08-28
    
    > tail(employees, 3)
     emp_no birth_date first_name last_name gender hire_date
    8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15
    9 10009 1952-04-19 Sumant Peac F 1985-02-18
    10 10010 1963-06-01 Duangkaew Piveteau F 1989-08-24
    
  2. You can also use the square bracket to subset the first three rows of the data with a given sequence from 1 to 3:
    > employees[1:3,]
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    2 10002 1964-06-02 Bezalel Simmel F 1985-11-21
    3 10003 1959-12-03 Parto Bamford M 1986-08-28
    
  3. Then, you can also specify the sequence of columns that you should select:
    > employees[1:3, 2:4]
     birth_date first_name last_name
    1 1953-09-02 Georgi Facello
    2 1964-06-02 Bezalel Simmel
    3 1959-12-03 Parto Bamford
    
  4. Besides subsetting a sequence of columns and rows from the dataset, you can specify certain rows and columns to subset with index vectors:
    > employees[c(2,5), c(1,3)]
     emp_no first_name
    2 10002 Bezalel
    5 10005 Kyoichi
    
  5. If you know the name of the column, you can also select columns with a given name vector:
    > employees[1:3, c("first_name","last_name")]
     first_name last_name
    1 Georgi Facello
    2 Bezalel Simmel
    3 Parto Bamford
    
  6. On the other hand, you can exclude columns with negative index:
    > employees[1:3,-6]
     emp_no birth_date first_name last_name gender
    1 10001 1953-09-02 Georgi Facello M
    2 10002 1964-06-02 Bezalel Simmel F
    3 10003 1959-12-03 Parto Bamford M
    
  7. You can also exclude some attributes using the in and ! operators:
    > employees[1:3, !names(employees) %in% c("last_name", "first_name")]
     emp_no birth_date gender hire_date
    1 10001 1953-09-02 M 1986-06-26
    2 10002 1964-06-02 F 1985-11-21
    3 10003 1959-12-03 M 1986-08-28
    
  8. Furthermore, you can set the equal condition to subset data:
    > employees[employees$gender == 'M',]
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    3 10003 1959-12-03 Parto Bamford M 1986-08-28
    4 10004 1954-05-01 Chirstian Koblick M 1986-12-01
    5 10005 1955-01-21 Kyoichi Maliniak M 1989-09-12
    8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15
    
  9. You can use a comparison operator to subset data:
    > salaries[salaries$salary >= 60000 & salaries$salary < 70000,]
    
  10. Additionally, the substr function can extract partial records:
    > employees[substr(employees$first_name,0,2)=="Ge",]
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    
  11. The regular expression is another useful and powerful tool for a user to subset data that they are interested in:
    > employees[grep('[aeiou]$', employees$first_name),]
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    3 10003 1959-12-03 Parto Bamford M 1986-08-28
    5 10005 1955-01-21 Kyoichi Maliniak M 1989-09-12
    6 10006 1953-04-20 Anneke Preusig F 1989-06-02
    8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15
    

How it works…

In this recipe, we demonstrated how to filter data with R. In the first case, we used the head and tail functions to examine the first few rows. By default, the head and tail functions will return the first six rows and last six rows of the dataset. We can still specify the number of records to subset in the second input parameter of the function.

Besides using the head and tail functions, we can use a square bracket to subset data. Using the square bracket, the value on the left-hand side of the comma assigns the rows to subset, and the value on the right-hand side of the comma indicates the columns to select. In the second step, we demonstrated that we can subset the first three records from the dataset by assigning a sequence of 1 to 3 on the left-hand side of the comma. If we do not specify anything on the right-hand side of the comma, this means that we will select all variables in our subset. Otherwise, we can specify the columns to select on the right-hand side of the comma. Similarly to step 3, we can select the second to fourth column by specifying a sequence on the right-hand side of the comma; select columns with a given c(3,5) index vector. Furthermore, we can select data with a given c("first_name","last_name") attribute name vector.

In addition to selecting the variables that we require, we can exclude the columns that we do not need with a negative index. Thus, we can place -6 on the right-hand side of the comma to exclude the sixth column from the dataset. We can also use both the ! and in operators to exclude data with a certain column name. In the seventh case, we can exclude attributes with the name as first_name and last_name.

Furthermore, we can filter data similar to SQL operation with a given condition. Here, as we need to use conditions to filter records from the data, we should place the criteria on the left-hand side of the comma. Thus, in cases 8 to 10, we demonstrate that we can filter male employee data with an equality condition, extract salary data ranging between 60,000 and 70,000, and retrieve employees with the first two letters matching Ge with the substr function. Finally, we can also employ the grep function along with a regular expression to subset employees with a vowel as the final letter of their first name.

There's more…

Besides using square brackets, we can also use the subset function to subset data:

  1. We can select first_name and last_name of the first three rows of the employees data:
    > subset(employees, rownames(employees) %in% 1:3, select=c("first_name","last_name"))
     first_name last_name
    1 Georgi Facello
    2 Bezalel Simmel
    3 Parto Bamford
    
  2. We can also set the condition to filter employee data by gender:
    >subset(employees, employees$gender == 'M')
     emp_no birth_date first_name last_name gender hire_date
    1 10001 1953-09-02 Georgi Facello M 1986-06-26
    3 10003 1959-12-03 Parto Bamford M 1986-08-28
    4 10004 1954-05-01 Chirstian Koblick M 1986-12-01
    5 10005 1955-01-21 Kyoichi Maliniak M 1989-09-12
    8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15
    
主站蜘蛛池模板: 枝江市| 宿迁市| 牡丹江市| 阜康市| 衡阳县| 裕民县| 桦川县| 当涂县| 丰城市| 禄劝| 西平县| 沅陵县| 益阳市| 荆州市| 新源县| 扎兰屯市| 铜陵市| 五台县| 高碑店市| 湘西| 东乡| 遂平县| 铜鼓县| 邯郸县| 会宁县| 电白县| 祁阳县| 龙州县| 丹阳市| 高台县| 廊坊市| 平遥县| 万宁市| 贺州市| 会同县| 信丰县| 宝兴县| 柯坪县| 尖扎县| 奉节县| 太和县|