- Predictive Analytics Using Rattle and Qlik Sense
- Ferran Garcia Pagans
- 910字
- 2021-07-16 13:40:18
Datasets, observations, and variables
A dataset is a collection of data that we're going to use to create new predictions. There are different kinds of datasets. When we use a dataset for predictive analytics, we can consider a dataset like a table with columns and rows.
In a real-life problem, our dataset would be related to the problem we want to solve. If we want to predict which customer is most likely to buy a product, our dataset would probably contain customer and historic sales data. When we're learning, we need to find an appropriate dataset for our learning purposes. You can find a lot of example datasets on the Internet; in this chapter, and in the following one, we're going to use the Titanic passenger list as a dataset that has been taken from Kaggle.
Note
Kaggle is the world's largest community of data scientists. On this website, you can even find data science competitions. We're not going to use the term data science, in this book, because there are a lot of new terms around analytics and we want to focus just on a few to avoid noise. Currently, we use this term to refer to an engineering area dedicated to collect, clean, and manipulate data to discover new knowledge. On www.kaggle.com, you can find different types of competitions; there are introductory competitions for beginners and competitions with monetary prices. You can access a competition, download the data and the problem description, and create your own solutions. An example of an introductory Kaggle competition is Titanic: Machine Learning from Disaster. You can download this dataset at https://www.kaggle.com/c/titanic-gettingStarted. We're going to use this dataset in this chapter and in Chapter 3, Exploring and Understanding Your Data.
A dataset is a matrix where each row is an observation or member of the dataset. In the Titanic passenger list, each observation contains the data related to a passenger. In a dataset, each column is a particular variable. In the passenger list, the column Sex is a variable. You can see a part of the Titanic passenger list in the following screenshot:

Before we start, we need to understand our dataset. When we download a dataset from the Web, it usually has a variable description document.
The following is the variable description for our dataset:
- Survived: If the passenger survived, the value of this variable is set to
1
, and if the passenger did not survive, it is set to0
. - Pclass: This stands for the class the passenger was travelling by. This variable can have three possible values:
1
,2
, and3
(1
= first class;2
= second class;3
= third class). - Name: This variable holds the name of the passenger.
- Sex: This variable has two possible values male or female.
- Age: This variable holds the age of the passenger.
- SibSp: This holds the number of siblings/spouses aboard.
- Parch: This holds the number of parents/children aboard.
- Ticket: This holds the ticket number.
- Fare: This variable holds the passenger's fare.
- Cabin: This variable holds the cabin number.
- Embarked: This is the port of embarkation. This variable has three possible values: C, Q, and S (C = Cherbourg; Q = Queenstown; S = Southampton).
For predictive purposes, there are two kinds of variables:
- Output variables or target variables: These are the variables we want to predict. In the passenger list, the variable Survived is an output variable. This means that we want to predict if a passenger will survive the sinking.
- Input variables: These are the variables we'll use to create a prediction. In the passenger list, the variable
sex
is an input variable.
Rattle refers to output variables as target variables. To avoid confusion, we're going to use the term target variable throughout this book. In this dataset, we've ten input variables (Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked) that we want to use to predict if this person is a potential customer or not. So in this example, our target variable is Survived.
In Titanic: Machine Learning from Disaster, the passenger list is pided into two CSV files: train.csv
and test.csv
. The file train.csv
contains 891 observations or passengers; for each observation, we have a value for the variable Survived. It means that we know if the passenger survived or not. The second file, test.csv
, contains only 418 customers, but in this file, we don't have the variable Survived. This means that we don't know if the passenger survived or not. The objective of the competition is to use the training file to create a model that predicts the value of the Survived variable in the test file. For this reason, the variable Survived is the target variable.
Rattle distinguishes two types of variables—numeric and categorical. A numeric variable describes a numerically measured value. In this dataset, Age, SibSp, Parch, and Fare are numeric variables.
A categorical variable is a variable that can be grouped into different categories. There are two types of categorical variables—ordinal and nominal. In an ordinal categorical variable the categories are represented by a number. In our dataset, Pclass is an ordinal categorical variable with three different categories or possible values 1, 2, and 3.
In a nominal categorical variable, the group is represented by a word label. In this dataset, Sex is an example of this type. This variable has only two possible values, and the values are the label, in this case, male and female.
- Beginning Java Data Structures and Algorithms
- C++面向對象程序設計(微課版)
- Three.js開發(fā)指南:基于WebGL和HTML5在網(wǎng)頁上渲染3D圖形和動畫(原書第3版)
- TestNG Beginner's Guide
- Banana Pi Cookbook
- Node.js全程實例
- Express Web Application Development
- 學習OpenCV 4:基于Python的算法實戰(zhàn)
- 編程菜鳥學Python數(shù)據(jù)分析
- QGIS 2 Cookbook
- C++ System Programming Cookbook
- Visual Basic程序設計基礎
- HTML5 WebSocket權威指南
- Learning WordPress REST API
- R語言:邁向大數(shù)據(jù)之路