官术网_书友最值得收藏!

Data types, formats, and sources

The three categorical characteristics of data are as follows:

  • Data types: Data type generally refers to the type of the data in the respective column. R supports character, numeric (real or decimal), integer, logical, and complex numbers. When reading data in from CSV files, R automatically tries to determine the type of data in each column of the file. This might not always work as desired. For instance, a column with the prices of products may have a sign or text indicating the name of the currency (for example, USD/$ and GBP/£). Columns with text data may have unicode characters (for instance, Cyrillic or Greek) with accent marks. Reading data from an external structured data source such as a database may be slightly more precise because there may be dedicated packages such as RODBC that can interpret data types across heterogeneous data sources.
  • Data formats: Datasets can come in a range of different formatstext files such as CSV files; tab delimited files; binary files such as Excel and SAS datasets; and external data sources such as databases, as explained earlier. Of these, CSV is one of the most portable cross-platform formats for storing data (it's simply the data separated by commas for each column). Tab-delimited and pipe-delimited are two of the other data formats that you may encounter during work. Binary files, such as Excel and SAS datasets, and external data sources represent the second and third types of data formats respectively. R also has its own binary formats, most notably, RDS, with which the user can store R objects natively in an R serialized format (using readRDS and saveRDS). Another option for storing R objects is .RData files, which are generally used to store a collection of objects (using save and load). In recent days, newer R binary formats have appeared. Feather is one such popular format that has shown impressive read/write I/O performance.
  • Data sources: Data sources refer to the source system from which data is retrieved. In a commercial setting, datasets are generally stored either in the cloud or in-house servers. The datasets can be accessed as web-based downloads or more commonly directly from the servers as a shared folder (for example, in Windows). Data vendors transmit data either via FTP or, in the case of sensitive data, using physical hard drives. Wherever the data may be, we need a means to access the dataset in order to use it with our R programs. R has native connectors to extract data directly from web-based URLs, from Hadoop-based storage such as HDFS, from databases using database connectors, and much more:

主站蜘蛛池模板: 文山县| 祁东县| 固原市| 鲁山县| 盐边县| 额敏县| 宁波市| 德兴市| 东阳市| 平塘县| 钟山县| 安化县| 揭东县| 秦皇岛市| 梅河口市| 安宁市| 天镇县| 铜川市| 贺兰县| 灵璧县| 临沂市| 黄大仙区| 扶风县| 建宁县| 中超| 郑州市| 西和县| 安庆市| 莆田市| 余江县| 武穴市| 新郑市| 英吉沙县| 乌兰浩特市| 伊通| 丰镇市| 连城县| 察哈| 瑞丽市| 广水市| 北碚区|