官术网_书友最值得收藏!

Scanning text files

In previous recipes, we introduced how to use read.table and read.csv to load data into an R session. However, read.table and read.csv only work if the number of columns is fixed and the data size is small. To be more flexible in data processing, we will demonstrate how to use the scan function to read data from the file.

Getting ready

In this recipe, you need to have completed the previous recipes and have snp500.csv downloaded in the current directory.

How to do it…

Please perform the following steps to scan data from the CSV file:

  1. First, you can use the scan function to read data from snp500.csv:
    > stock_data3 <- scan('snp500.csv',sep=',', what=list(Date = '', Open = 0, High = 0, Low = 0,Close = 0, Volume = 0, Adj_Close = 0), skip=1, fill=T)
    Read 16481 records
    
  2. You can then examine loaded data with mode and str:
    > mode(stock_data3)
    [1] "list"
    > str(stock_data3)
    List of 7
     $ Date : chr [1:16481] "2015-07-02" "2015-07-01" "2015-06-30" "2015-06-29" ...
     $ Open : num [1:16481] 2078 2067 2061 2099 2103 ...
     $ High : num [1:16481] 2085 2083 2074 2099 2109 ...
     $ Low : num [1:16481] 2071 2067 2056 2057 2095 ...
     $ Close : num [1:16481] 2077 2077 2063 2058 2102 ...
     $ Volume : num [1:16481] 3.00e+09 3.73e+09 4.08e+09 3.68e+09 5.03e+09 ...
     $ Adj_Close: num [1:16481] 2077 2077 2063 2058 2102 ...
    

How it works…

When comparing read.csv and read.table, the scan function is more flexible and efficient in data reading. Here, we specify the field name and support type of each field within a list in the what parameter. In this case, the first field is of character type, and the rest of the fields are of numeric type. Therefore, we can set two single (or double) quotes for the Date column, and 0 for the rest of the fields. Then, as we need to skip the header row and automatically add empty fields to any lines with fewer fields than the number of columns, we set skip to 1 and fill to True.

At this point, we can now examine the data with some built-in functions. Here, we use mode to obtain the type of the object and use str to display the structure of the data.

There's more…

On some occasions, the data is separated by fixed width rather than fixed delimiter. To specify the width of each column, you can use the read.fwf function:

  1. First, you can use download.file to download weather.op from the author's GitHub page:
    > download.file("https://github.com/ywchiu/rcookbook/raw/master/chapter2/weather.op", "weather.op")
    
  2. You can then examine the data with the file editor:

    Figure 5: Using the file editor to examine the file

  3. Read the data by specifying the width of each column in widths, the column name in col.names, and skip the first row by setting skip to 1:
    > weather <- read.fwf("weather.op", widths = c(6,6,10,11,9,8), col.names = c("STN","WBAN","YEARMODA","TEMP","MAX","MIN"), skip=1)
    
  4. Lastly, you can examine the data using the head and names functions:
    > head(weather)
     STN WBAN YEARMODA TEMP MAX MIN
    1 8403 99999 20140101 85.8 24 102.7* 69.3*
    2 8403 99999 20140102 86.3 24 102.9* 71.1*
    3 8403 99999 20140103 85.9 24 101.1* 72.0*
    4 8403 99999 20140104 85.6 24 102.7* 70.5*
    5 8403 99999 20140105 84.8 23 102.0* 66.6*
    6 8403 99999 20140106 86.8 23 102.0* 70.9*
    
    > names(weather)
    [1] "STN" "WBAN" "YEARMODA" "TEMP" "MAX" 
    [6] "MIN" 
    
主站蜘蛛池模板: 丽江市| 饶平县| 砀山县| 车致| 临西县| 麻城市| 同江市| 改则县| 泸溪县| 江阴市| 吐鲁番市| 广宁县| 成安县| 嵩明县| 奉化市| 克拉玛依市| 十堰市| 中阳县| 南丹县| 桓仁| 盐边县| 综艺| 灌云县| 亚东县| 克拉玛依市| 二连浩特市| 新丰县| 苏州市| 松阳县| 揭西县| 江陵县| 蒙城县| 从化市| 枣庄市| 闸北区| 阜宁县| 门头沟区| 台北县| 龙里县| 驻马店市| 分宜县|