- R for Data Science Cookbook
- Yu Wei Chiu (David Chiu)
- 1078字
- 2021-07-14 10:51:29
Filtering data
Data filtering is the most common requirement for users who want to analyze partial data of interest rather than the whole dataset. In database operations, we can use a SQL command with a where
clause to subset the data. In R, we can simply use the square bracket to perform filtering.
Getting ready
Refer to the Converting data types recipe and convert each attribute of imported data into the proper data type. Also, rename the columns of the employees
and salaries
datasets by following the steps from the Renaming the data variable recipe.
How to do it…
Perform the following steps to filter data:
- First, use
head
andtail
to subset the first three rows and last three rows from theemployees
dataset:> head(employees, 3) emp_no birth_date first_name last_name gender hire_date 1 10001 1953-09-02 Georgi Facello M 1986-06-26 2 10002 1964-06-02 Bezalel Simmel F 1985-11-21 3 10003 1959-12-03 Parto Bamford M 1986-08-28 > tail(employees, 3) emp_no birth_date first_name last_name gender hire_date 8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15 9 10009 1952-04-19 Sumant Peac F 1985-02-18 10 10010 1963-06-01 Duangkaew Piveteau F 1989-08-24
- You can also use the square bracket to subset the first three rows of the data with a given sequence from
1
to3
:> employees[1:3,] emp_no birth_date first_name last_name gender hire_date 1 10001 1953-09-02 Georgi Facello M 1986-06-26 2 10002 1964-06-02 Bezalel Simmel F 1985-11-21 3 10003 1959-12-03 Parto Bamford M 1986-08-28
- Then, you can also specify the sequence of columns that you should select:
> employees[1:3, 2:4] birth_date first_name last_name 1 1953-09-02 Georgi Facello 2 1964-06-02 Bezalel Simmel 3 1959-12-03 Parto Bamford
- Besides subsetting a sequence of columns and rows from the dataset, you can specify certain rows and columns to subset with index vectors:
> employees[c(2,5), c(1,3)] emp_no first_name 2 10002 Bezalel 5 10005 Kyoichi
- If you know the name of the column, you can also select columns with a given name vector:
> employees[1:3, c("first_name","last_name")] first_name last_name 1 Georgi Facello 2 Bezalel Simmel 3 Parto Bamford
- On the other hand, you can exclude columns with negative index:
> employees[1:3,-6] emp_no birth_date first_name last_name gender 1 10001 1953-09-02 Georgi Facello M 2 10002 1964-06-02 Bezalel Simmel F 3 10003 1959-12-03 Parto Bamford M
- You can also exclude some attributes using the
in
and!
operators:> employees[1:3, !names(employees) %in% c("last_name", "first_name")] emp_no birth_date gender hire_date 1 10001 1953-09-02 M 1986-06-26 2 10002 1964-06-02 F 1985-11-21 3 10003 1959-12-03 M 1986-08-28
- Furthermore, you can set the equal condition to subset data:
> employees[employees$gender == 'M',] emp_no birth_date first_name last_name gender hire_date 1 10001 1953-09-02 Georgi Facello M 1986-06-26 3 10003 1959-12-03 Parto Bamford M 1986-08-28 4 10004 1954-05-01 Chirstian Koblick M 1986-12-01 5 10005 1955-01-21 Kyoichi Maliniak M 1989-09-12 8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15
- You can use a comparison operator to subset data:
> salaries[salaries$salary >= 60000 & salaries$salary < 70000,]
- Additionally, the
substr
function can extract partial records:> employees[substr(employees$first_name,0,2)=="Ge",] emp_no birth_date first_name last_name gender hire_date 1 10001 1953-09-02 Georgi Facello M 1986-06-26
- The regular expression is another useful and powerful tool for a user to subset data that they are interested in:
> employees[grep('[aeiou]$', employees$first_name),] emp_no birth_date first_name last_name gender hire_date 1 10001 1953-09-02 Georgi Facello M 1986-06-26 3 10003 1959-12-03 Parto Bamford M 1986-08-28 5 10005 1955-01-21 Kyoichi Maliniak M 1989-09-12 6 10006 1953-04-20 Anneke Preusig F 1989-06-02 8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15
How it works…
In this recipe, we demonstrated how to filter data with R. In the first case, we used the head
and tail
functions to examine the first few rows. By default, the head
and tail
functions will return the first six rows and last six rows of the dataset. We can still specify the number of records to subset in the second input parameter of the function.
Besides using the head
and tail
functions, we can use a square bracket to subset data. Using the square bracket, the value on the left-hand side of the comma assigns the rows to subset, and the value on the right-hand side of the comma indicates the columns to select. In the second step, we demonstrated that we can subset the first three records from the dataset by assigning a sequence of 1 to 3 on the left-hand side of the comma. If we do not specify anything on the right-hand side of the comma, this means that we will select all variables in our subset. Otherwise, we can specify the columns to select on the right-hand side of the comma. Similarly to step 3, we can select the second to fourth column by specifying a sequence on the right-hand side of the comma; select columns with a given c(3,5)
index vector. Furthermore, we can select data with a given c("first_name","last_name")
attribute name vector.
In addition to selecting the variables that we require, we can exclude the columns that we do not need with a negative index. Thus, we can place -6
on the right-hand side of the comma to exclude the sixth column from the dataset. We can also use both the !
and in
operators to exclude data with a certain column name. In the seventh case, we can exclude attributes with the name as first_name
and last_name
.
Furthermore, we can filter data similar to SQL operation with a given condition. Here, as we need to use conditions to filter records from the data, we should place the criteria on the left-hand side of the comma. Thus, in cases 8 to 10, we demonstrate that we can filter male employee data with an equality condition, extract salary data ranging between 60,000 and 70,000, and retrieve employees with the first two letters matching Ge with the substr
function. Finally, we can also employ the grep
function along with a regular expression to subset employees with a vowel as the final letter of their first name.
There's more…
Besides using square brackets, we can also use the subset
function to subset data:
- We can select
first_name
andlast_name
of the first three rows of theemployees
data:> subset(employees, rownames(employees) %in% 1:3, select=c("first_name","last_name")) first_name last_name 1 Georgi Facello 2 Bezalel Simmel 3 Parto Bamford
- We can also set the condition to filter employee data by gender:
>subset(employees, employees$gender == 'M') emp_no birth_date first_name last_name gender hire_date 1 10001 1953-09-02 Georgi Facello M 1986-06-26 3 10003 1959-12-03 Parto Bamford M 1986-08-28 4 10004 1954-05-01 Chirstian Koblick M 1986-12-01 5 10005 1955-01-21 Kyoichi Maliniak M 1989-09-12 8 10008 1958-02-19 Saniya Kalloufi M 1994-09-15
- 新編Visual Basic程序設計上機實驗教程
- Unity 2020 Mobile Game Development
- Mastering Natural Language Processing with Python
- Mastering Python Scripting for System Administrators
- Android 7編程入門經典:使用Android Studio 2(第4版)
- Python王者歸來
- The Data Visualization Workshop
- C++ 從入門到項目實踐(超值版)
- Linux命令行與shell腳本編程大全(第4版)
- 用戶體驗增長:數字化·智能化·綠色化
- Java Web開發詳解
- Learning Continuous Integration with TeamCity
- Web性能實戰
- 創意UI:Photoshop玩轉APP設計
- Illustrator CS6設計與應用任務教程