- Learning Shiny
- Hernán G. Resnizky
- 796字
- 2021-07-09 21:46:11
Reading data
The formats and structures in which data comes can be varied. However, thanks to its contributive feature and extensibility, there is a package to load data into R for almost every data structure (at least the standard ones). In order to do this, it is always necessary to use functions that have different argument types according to their nature.
Delimited data
All the delimited formats in R use the same base function, that is, read.table()
. This function uses many arguments but most of them have a default value. The following is a list of the most important ones:
header
: If it is set toT
, the first row is used to assign the names of the data frame.nrows
: This gives the amount of rows to be read. If it is set to-1
, all rows are read.skip
: This states how many rows to skip before reading is started.encoding
: In case the data source contains non-ASCII characters (for example, words in languages different from English), encoding can be passed.Note
For information about the rest of the arguments, visit https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html.
The only non-default argument that this function and its derivatives (read.csv()
, read.delim()
, and so on) have is file, the path to where the data input file is located. The path can be local or a URL. However, it is usually useful (and safer) to specify the delimiter as read.table
, which uses whitespace as a default delimiter:
#URL to Iris Dataset path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" #Load dataset with generic read.table() data <- read.table(path, sep=",")
The output of this function is a named data.frame,
as shown here:
> class(data) [1] "data.frame"
Reading line by line
The function to read texts line by line is readLines()
. As in the case of read.table()
, the only required argument is the file path or a connection object (connection objects will be not covered here, for further information, visit https://stat.ethz.ch/R-manual/R-devel/library/base/html/connections.html). The readLines()
function mainly reads a string and separates it by newline (\n
), as follows:
#URL to Iris Dataset path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" #Load dataset with readLines() data <- readLines(path)
The output of readLines()
is a character vector whose elements correspond to the lines of the read file, as shown here:
> class(data) [1] "character" > length(data) [1] 151
Reading a character set
The function to read characters is readChar()
. In this case, not only the file path or a connection object is needed, but also the number of characters that must be read (the nchars
argument). If nchars
is greater than the total number of characters in the string, it will stop at the end of the string, as follows:
#URL to Iris Dataset path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" #Load dataset with readChar data <- readChar(path,nchars= 1e5)
The output of readChar()
is a character object, which is equal to a character vector of length 1
, as the following code shows:
> class(data) [1] "character" > length(data) [1] 1
Reading JSON
JSON is an acronym that stands for JavaScript Object Notation and is basically a non-structured data storage format, which will be discussed later in this book. As functions to read JSON do not come in the default packages, installing new packages is required. The commonly used packages for this purpose are RJSONIO
and rjson
. Although both the packages return similar things, the main difference between them is that the first one can load data from connections directly but the second one needs an intermediate step to load data into R.
Here's an example with RJSONIO
:
#Load RJSONIO library(RJSONIO) #URL Public API Worldbank Data Catalog in JSON format url <- "http://api.worldbank.org/v2/datacatalog?format=json" #Read data directly from url json <- fromJSON(url)
#Load RJSONIO library(rjson) #URL Public API Worldbank Data Catalog in JSON format url <- "http://api.worldbank.org/v2/datacatalog?format=json" #Read data with readChar raw.json <- readChar(url,nchars=1e6) #Format into JSON json <- fromJSON(raw.json)
As both the packages share the same function names, the last loaded package will override the function of the other one. In this case, for instance, if rjson
is loaded after RJSONIO
, fromJSON()
will work as defined in rjson
and not RJSONIO
. In such cases, you will receive this message:
library(RJSONIO) library(rjson) ## ## Attaching package: 'rjson' ## ## The following objects are masked from 'package:RJSONIO': ## ## fromJSON, toJSON
The output in both cases is a list.
Reading XML
XML is another non-structured data storage format and stands for Extended Markup Language. Although it has been lately replaced by JSON, XML is still frequently found, for example, in feeds. To read XML files, the XML
package is recommended. This package has a large number of functions. The following is an example of how to load XML data into R:
#Load XML library library(XML) #URL Public API Worldbank Data Catalog in XML format url <- "http://api.worldbank.org/v2/datacatalog?format=xml" #Load XML document xml.obj <- xmlTreeParse(url)
The object returned is of the XMLDocument
class:
> class(xml.obj) [1] "XMLDocument" "XMLAbstractDocument"
Reading databases – SQL
The packages used to interface with relational databases are RODBC for ODBC connectivity and RJDBC for JDBC. For obvious reasons, it is impossible in this case to refer to a concrete example. In order to use and understand in depth the capabilities of these packages, prior knowledge of ODBC/JDBC is required.
The documentation is available at http://cran.r-project.org/web/packages/RODBC/RODBC.pdf and http://cran.r-project.org/web/packages/RJDBC/RJDBC.pdf.
Reading data from external sources
For almost every data file in tabular form, there is a package to import it to R. It is out of the scope of this book to go further into this. The most important ones are xlsx
(for Excel files), Hmisc
(for SAS and SPSS portable files), and foreign
(for SAS, SPSS, Stata, Octave, and Weka among others). However, it is always preferred, when possible, to convert any of these files to a standard text file format, such as .csv
to ensure that unexpected (and sometimes very difficult to solve) problems are avoided.
- Maven Build Customization
- 自己動手實(shí)現(xiàn)Lua:虛擬機(jī)、編譯器和標(biāo)準(zhǔn)庫
- 編程卓越之道(卷3):軟件工程化
- C/C++算法從菜鳥到達(dá)人
- 青少年軟件編程基礎(chǔ)與實(shí)戰(zhàn)(圖形化編程三級)
- HTML5+CSS3網(wǎng)站設(shè)計(jì)基礎(chǔ)教程
- Python Data Analysis Cookbook
- Android開發(fā)案例教程與項(xiàng)目實(shí)戰(zhàn)(在線實(shí)驗(yàn)+在線自測)
- Visual FoxPro程序設(shè)計(jì)習(xí)題集及實(shí)驗(yàn)指導(dǎo)(第四版)
- Access 2010數(shù)據(jù)庫應(yīng)用技術(shù)(第2版)
- 算法設(shè)計(jì)與分析:基于C++編程語言的描述
- Learning Bootstrap 4(Second Edition)
- LabVIEW數(shù)據(jù)采集
- Getting Started with Web Components
- Python趣味創(chuàng)意編程