官术网_书友最值得收藏!

Data frames

The data frame is the main data structure in R. It's possible to envisage the data frame as a table of data, with rows and columns. Unlike the list structure, the data frame can contain different types of data. In R, we use the data.frame() command in order to create a data frame.

The data frame is extremely flexible for working with structured data, and it can ingest data from many different data types. Two main ways to ingest data into data frames involves the use of many data connectors, which connect to data sources such as databases, for example. There is also a command, read.table(), which takes in data.

Data frames

Data Frame Structure

Here is an example, populated data frame. There are three columns, and two rows. The top of the data frame is the header. Each row holds a line of data row, starting with the row name, and then followed by the data itself. Each data member of a row is called a cell.

Data frames

Example Data Frame Structure

In R, we can create data frames by accessing external data, or we can create our own data frames by assigning data to a variable. Let's set up our own example data frame, populated with data:

df = data.frame(
Year=c(2013, 2013, 2013), 
Country=c("Arab World","Carribean States", "Central Europe"),
LifeExpectancy=c(71, 72, 76))

As always, we should read out at least some of the data frame so we can double-check that it was set correctly. The data frame was set to the df variable, so we can read out the contents by simply typing in the variable name at the command prompt:

Data frames

Variable printout to the R Console

To obtain the data held in a cell, we enter the row and column co-ordinates of the cell, and surround them by square brackets ([]). In this example, if we wanted to obtain the value of the second cell in the second row, then we would use the following:

df[2, "Country"]

We can also conduct summary statistics on our data frame. For example, if we use the following command:

summary(df)

Then we obtain the summary statistics of the data. The example output is as follows:

Data frames

Summary Statistics printout to the R Console

You'll notice that the summary command has summarized different values for each of the columns. It has identified Year as an integer, and produced the Min, Quartiles, Mean, and Max for the year. The Country column has been listed, simply because it does not contain any numeric values. Life Expectancy is summarized correctly.

We can change the Year column to a factor, using the following command:

df$Year <- as.factor(df$Year)

Then, we can rerun the summary command again:

summary(df)

On this occasion, the data frame now returns the correct results that we expect:

Data frames

Variable printout to the R Console

As we proceed throughout this book, we will be building on more useful features that will help us to analyze data using data structures, and visualize the data in interesting ways using R.

When we consume data from online data sources, it's worth double-checking the data types in the source data. The summary(df) command is very useful.

We can retrieve data in Tableau, using commands that we have used so far in this Chapter. Firstly, however, we need to make sure that Rserve is installed and running. Let's check the installation first, with the command:

install.packages("Rserve")

Once the command has executed, we need to call the package so we can use it throughout the script:

library(Rserve)

Next, we can start the Rserve service with the following command:

Rserve()

In this example, however, we are simply going to work with the CSV file that contains the data. To do this, let's open up a new Tableau workbook, and we will choose excel as our format.

Now, let's connect live to the excel data source. When we connect to the data in Tableau, we can see the interface here:

Data frames

As a piece of terminology, note that R talks about variables. In tableau, we talk about dimensions and when we use the Dimension Year as String, plus the Value, we get horizontal bars.

Data frames

We can start to add in the country, which appears as follows:

Data frames

However, this doesn't really give a sense of the changes over time, which is our preferred end result. To achieve this objective, let's look at the box-and-whisker plot.

Data frames

Here, it's clearer to see that the fertility rate has been descending over time. Let's focus on just a few countries – Rwanda, Norway, and the United States

Data frames

We can filter our selection down to the countries that we are most interested in. Now, we can see patterns in the data more clearly.

Data frames

A few simple changes have helped to illuminate the data:

We can see that the USA and Norway track one another very closely. Rwanda, on the other hand, has the highest birth rate, which falls down over the years. The tops of the box-and-whisker plots have been changed to show a line, in order to emphasise how this metric has changed over time.

What do the box-and-whisker plot lines actually mean? They tell us something individually about the range between the minimum and maximum numbers. Here is an example diagram:

Data frames

Rwanda is the upper whisker – meaning the maximum. The first and third quartiles are given, along with the median.

The tooltip gives the viewer additional details. It is provided 'on demand', when the user hovers over that part of the chart.

To summarise, we have seen how R and Tableau can be used together in order to display data better. Generally speaking, it is better to change the data closer to the source rather than leaving it until the front end. The reason for this is that you have only changed the data once, which then propagates through to other data sources and worksheets. It's not required for you to change it every time.

Now that we have seen a simple example of how R and Tableau can work together, let's look at more complex R programming constructs.

主站蜘蛛池模板: 敦化市| 富源县| 高青县| 三亚市| 阳春市| 新余市| 河曲县| 柏乡县| 鹿邑县| 黄浦区| 孟连| 咸阳市| 铜梁县| 天峨县| 桂林市| 罗定市| 淮安市| 安达市| 舟曲县| 万山特区| 西林县| 交口县| 雷州市| 石嘴山市| 两当县| 海安县| 巴塘县| 徐水县| 浑源县| 宜春市| 陆丰市| 黔江区| 图木舒克市| 东兰县| 盐城市| 霍州市| 安康市| 平安县| 广东省| 长顺县| 东阿县|