- R for Data Science Cookbook
- Yu Wei Chiu (David Chiu)
- 445字
- 2021-07-14 10:51:28
Converting data types
If we do not specify a data type during the import phase, R will automatically assign a type to the imported dataset. However, if the data type assigned is different to the actual type, we may face difficulties in further data manipulation. Thus, data type conversion is an essential step during the preprocessing phase.
Getting ready
Complete the previous recipe and import both employees.csv
and salaries.csv
into an R session. You must also specify column names for these two datasets to be able to perform the following steps.
How to do it…
Perform the following steps to convert the data type:
- First, examine the data type of each attribute using the
class
function:> class(employees$birth_date) [1] "factor"
- You can also examine types of all attributes using the
str
function:> str(employees) 'data.frame': 10 obs. of 6 variables: $ emp_no : int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 $ birth_date: Factor w/ 10 levels "1952-04-19","1953-04-20",..: 3 10 8 4 5 2 6 7 1 9 $ first_name: Factor w/ 10 levels "Anneke","Bezalel",..: 5 2 7 3 6 1 10 8 9 4 $ last_name : Factor w/ 10 levels "Bamford","Facello",..: 2 9 1 4 5 8 10 3 6 7 $ gender : Factor w/ 2 levels "F","M": 2 1 2 2 2 1 1 2 1 1 $ hire_date : Factor w/ 10 levels "1985-02-18","1985-11-21",..: 3 2 4 5 9 7 6 10 1 8
- Then, you need to convert both
birth_date
andhired_date
to the date format:> employees$birth_date <- as.Date(employees$birth_date) > employees$hire_date <- as.Date(employees$hire_date)
- You also need to convert both
first_name
andlast_name
into character type:> employees$first_name <- as.character(employees$first_name) > employees$last_name <- as.character(employees$last_name)
- Again, you can use
str
to examine the dataset:> str(employees) 'data.frame': 10 obs. of 6 variables: $ emp_no : int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 $ birth_date: Date, format: "1953-09-02" ... $ first_name: chr "Georgi" "Bezalel" "Parto" "Chirstian" ... $ last_name : chr "Facello" "Simmel" "Bamford" "Koblick" ... $ gender : Factor w/ 2 levels "F","M": 2 1 2 2 2 1 1 2 1 1 $ hire_date : Date, format: "1986-06-26" ...
- Furthermore, you can convert the data type of
from_date
andto_date
to date type withinsalaries
:> salaries$from_date <- as.Date(salaries$from_date) > salaries$to_date <- as.Date(salaries$to_date)
How it works…
In this recipe, we demonstrated how to convert the data type of each attribute within the dataset. Before conducting further conversion on any attribute, you must first examine the current type of each attribute. To identify the data type, you can use the class
function to determine the data-selecting attribute. Furthermore, to inspect all data types, you can use the str
function.
From the output of applying the str
function to the employees
data frame, we can see that both birth_date
and hire_date
are in factor type. However, if we need to calculate one's age with the birth_date
attribute, we need to convert it to date format. Thus, we change both birth_date
and hire_date
to date format using the as.Date
function.
Also, as the factor type limits the choice of values in one attribute, we may not freely add a record to the dataset. As it is hard to find exactly the same last name and first name from the dataset, we need to convert last_name
and first_name
to the character type. We can then proceed to append a new record to the employees
dataset in the next recipe. Finally, we should also convert from_date
and to_date
of the salaries dataset to date type, and we can then perform date calculations in the next recipe.
There's more…
Besides using an as function to convert data type, you can specify the data type during the data import phase. Using the read.csv
function as an example, you can specify the data type in the colClasses
argument. If you want R to automatically select the data type (that is, automatically convert emp_no
to integer type), simply specify NA
within colClasses
:
> employees <- read.csv('~/Desktop/employees.csv', colClasses = c(NA,"Date", "character", "character", "factor", "Date"), head=FALSE) > str(employees) 'data.frame': 10 obs. of 6 variables: $ V1: int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 $ V2: Date, format: "1953-09-02" ... $ V3: chr "Georgi" "Bezalel" "Parto" "Chirstian" ... $ V4: chr "Facello" "Simmel" "Bamford" "Koblick" ... $ V5: Factor w/ 2 levels "F","M": 2 1 2 2 2 1 1 2 1 1 $ V6: Date, format: "1986-06-26" ...
By specifying the colClasses
argument, emp_no
, birth_date
, first_name
, last_name
, gender
, and hire_date
will be converted into integer type, date type, character type, character type, factor type, and date type respectively.
- iOS面試一戰到底
- Advanced Machine Learning with Python
- Spring技術內幕:深入解析Spring架構與設計
- Access 數據庫應用教程
- Java EE框架整合開發入門到實戰:Spring+Spring MVC+MyBatis(微課版)
- 體驗設計原理:行為、情感和細節
- 技術領導力:程序員如何才能帶團隊
- Groovy for Domain:specific Languages(Second Edition)
- Internet of Things with Intel Galileo
- Java:High-Performance Apps with Java 9
- Python算法詳解
- Web編程基礎:HTML5、CSS3、JavaScript(第2版)
- Android編程權威指南(第4版)
- After Effects CC案例設計與經典插件(視頻教學版)
- Laravel Design Patterns and Best Practices