官术网_书友最值得收藏!

Converting data types

If we do not specify a data type during the import phase, R will automatically assign a type to the imported dataset. However, if the data type assigned is different to the actual type, we may face difficulties in further data manipulation. Thus, data type conversion is an essential step during the preprocessing phase.

Getting ready

Complete the previous recipe and import both employees.csv and salaries.csv into an R session. You must also specify column names for these two datasets to be able to perform the following steps.

How to do it…

Perform the following steps to convert the data type:

  1. First, examine the data type of each attribute using the class function:
    > class(employees$birth_date)
    [1] "factor"
    
  2. You can also examine types of all attributes using the str function:
    > str(employees)
    
    'data.frame': 10 obs. of 6 variables:
     $ emp_no : int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010
     $ birth_date: Factor w/ 10 levels "1952-04-19","1953-04-20",..: 3 10 8 4 5 2 6 7 1 9
     $ first_name: Factor w/ 10 levels "Anneke","Bezalel",..: 5 2 7 3 6 1 10 8 9 4
     $ last_name : Factor w/ 10 levels "Bamford","Facello",..: 2 9 1 4 5 8 10 3 6 7
     $ gender : Factor w/ 2 levels "F","M": 2 1 2 2 2 1 1 2 1 1
     $ hire_date : Factor w/ 10 levels "1985-02-18","1985-11-21",..: 3 2 4 5 9 7 6 10 1 8
    
  3. Then, you need to convert both birth_date and hired_date to the date format:
    > employees$birth_date <- as.Date(employees$birth_date)
    > employees$hire_date <- as.Date(employees$hire_date)
    
  4. You also need to convert both first_name and last_name into character type:
    > employees$first_name <- as.character(employees$first_name)
    > employees$last_name <- as.character(employees$last_name)
    
  5. Again, you can use str to examine the dataset:
    > str(employees)
    
    'data.frame': 10 obs. of 6 variables:
     $ emp_no : int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010
     $ birth_date: Date, format: "1953-09-02" ...
     $ first_name: chr "Georgi" "Bezalel" "Parto" "Chirstian" ...
     $ last_name : chr "Facello" "Simmel" "Bamford" "Koblick" ...
     $ gender : Factor w/ 2 levels "F","M": 2 1 2 2 2 1 1 2 1 1
     $ hire_date : Date, format: "1986-06-26" ...
    
  6. Furthermore, you can convert the data type of from_date and to_date to date type within salaries:
    > salaries$from_date <- as.Date(salaries$from_date)
    > salaries$to_date <- as.Date(salaries$to_date)
    

How it works…

In this recipe, we demonstrated how to convert the data type of each attribute within the dataset. Before conducting further conversion on any attribute, you must first examine the current type of each attribute. To identify the data type, you can use the class function to determine the data-selecting attribute. Furthermore, to inspect all data types, you can use the str function.

From the output of applying the str function to the employees data frame, we can see that both birth_date and hire_date are in factor type. However, if we need to calculate one's age with the birth_date attribute, we need to convert it to date format. Thus, we change both birth_date and hire_date to date format using the as.Date function.

Also, as the factor type limits the choice of values in one attribute, we may not freely add a record to the dataset. As it is hard to find exactly the same last name and first name from the dataset, we need to convert last_name and first_name to the character type. We can then proceed to append a new record to the employees dataset in the next recipe. Finally, we should also convert from_date and to_date of the salaries dataset to date type, and we can then perform date calculations in the next recipe.

There's more…

Besides using an as function to convert data type, you can specify the data type during the data import phase. Using the read.csv function as an example, you can specify the data type in the colClasses argument. If you want R to automatically select the data type (that is, automatically convert emp_no to integer type), simply specify NA within colClasses:

> employees <- read.csv('~/Desktop/employees.csv', colClasses = c(NA,"Date", "character", "character", "factor", "Date"), head=FALSE)
> str(employees)
'data.frame': 10 obs. of 6 variables:
 $ V1: int 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010
 $ V2: Date, format: "1953-09-02" ...
 $ V3: chr "Georgi" "Bezalel" "Parto" "Chirstian" ...
 $ V4: chr "Facello" "Simmel" "Bamford" "Koblick" ...
 $ V5: Factor w/ 2 levels "F","M": 2 1 2 2 2 1 1 2 1 1
 $ V6: Date, format: "1986-06-26" ...

By specifying the colClasses argument, emp_no, birth_date, first_name, last_name, gender, and hire_date will be converted into integer type, date type, character type, character type, factor type, and date type respectively.

主站蜘蛛池模板: 瑞丽市| 霍城县| 洪泽县| 静乐县| 临江市| 石棉县| 赤壁市| 浮梁县| 微山县| 青阳县| 万年县| 隆尧县| 阜康市| 新巴尔虎右旗| 荔波县| 孟津县| 丹凤县| 车致| 侯马市| 吴忠市| 廊坊市| 永泰县| 安图县| 乐东| 朝阳区| 遵化市| 左云县| 宁远县| 柘城县| 普安县| 来安县| 县级市| 天津市| 家居| 长海县| 红原县| 连山| 嘉定区| 贵阳市| 稷山县| 建平县|