書名： Hands-On Data Science with R
作者名： Vitor Bianchi Lanzetta Nataraj Dasgupta Ricardo Anjoleto Farias
本章字數： 445字
更新時間： 2021-06-10 19:12:34

Data types, formats, and sources

The three categorical characteristics of data are as follows:

Data types: Data type generally refers to the type of the data in the respective column. R supports character, numeric (real or decimal), integer, logical, and complex numbers. When reading data in from CSV files, R automatically tries to determine the type of data in each column of the file. This might not always work as desired. For instance, a column with the prices of products may have a sign or text indicating the name of the currency (for example, USD/$ and GBP/￡). Columns with text data may have unicode characters (for instance, Cyrillic or Greek) with accent marks. Reading data from an external structured data source such as a database may be slightly more precise because there may be dedicated packages such as RODBC that can interpret data types across heterogeneous data sources.
Data formats: Datasets can come in a range of different formats—text files such as CSV files; tab delimited files; binary files such as Excel and SAS datasets; and external data sources such as databases, as explained earlier. Of these, CSV is one of the most portable cross-platform formats for storing data (it's simply the data separated by commas for each column). Tab-delimited and pipe-delimited are two of the other data formats that you may encounter during work. Binary files, such as Excel and SAS datasets, and external data sources represent the second and third types of data formats respectively. R also has its own binary formats, most notably, RDS, with which the user can store R objects natively in an R serialized format (using readRDS and saveRDS). Another option for storing R objects is .RData files, which are generally used to store a collection of objects (using save and load). In recent days, newer R binary formats have appeared. Feather is one such popular format that has shown impressive read/write I/O performance.
Data sources: Data sources refer to the source system from which data is retrieved. In a commercial setting, datasets are generally stored either in the cloud or in-house servers. The datasets can be accessed as web-based downloads or more commonly directly from the servers as a shared folder (for example, in Windows). Data vendors transmit data either via FTP or, in the case of sensitive data, using physical hard drives. Wherever the data may be, we need a means to access the dataset in order to use it with our R programs. R has native connectors to extract data directly from web-based URLs, from Hadoop-based storage such as HDFS, from databases using database connectors, and much more:

官术网_书友最值得收藏!

Hands-On Data Science with R

Data types, formats, and sources