- Predictive Analytics Using Rattle and Qlik Sense
- Ferran Garcia Pagans
- 910字
- 2021-07-16 13:40:18
Datasets, observations, and variables
A dataset is a collection of data that we're going to use to create new predictions. There are different kinds of datasets. When we use a dataset for predictive analytics, we can consider a dataset like a table with columns and rows.
In a real-life problem, our dataset would be related to the problem we want to solve. If we want to predict which customer is most likely to buy a product, our dataset would probably contain customer and historic sales data. When we're learning, we need to find an appropriate dataset for our learning purposes. You can find a lot of example datasets on the Internet; in this chapter, and in the following one, we're going to use the Titanic passenger list as a dataset that has been taken from Kaggle.
Note
Kaggle is the world's largest community of data scientists. On this website, you can even find data science competitions. We're not going to use the term data science, in this book, because there are a lot of new terms around analytics and we want to focus just on a few to avoid noise. Currently, we use this term to refer to an engineering area dedicated to collect, clean, and manipulate data to discover new knowledge. On www.kaggle.com, you can find different types of competitions; there are introductory competitions for beginners and competitions with monetary prices. You can access a competition, download the data and the problem description, and create your own solutions. An example of an introductory Kaggle competition is Titanic: Machine Learning from Disaster. You can download this dataset at https://www.kaggle.com/c/titanic-gettingStarted. We're going to use this dataset in this chapter and in Chapter 3, Exploring and Understanding Your Data.
A dataset is a matrix where each row is an observation or member of the dataset. In the Titanic passenger list, each observation contains the data related to a passenger. In a dataset, each column is a particular variable. In the passenger list, the column Sex is a variable. You can see a part of the Titanic passenger list in the following screenshot:

Before we start, we need to understand our dataset. When we download a dataset from the Web, it usually has a variable description document.
The following is the variable description for our dataset:
- Survived: If the passenger survived, the value of this variable is set to
1
, and if the passenger did not survive, it is set to0
. - Pclass: This stands for the class the passenger was travelling by. This variable can have three possible values:
1
,2
, and3
(1
= first class;2
= second class;3
= third class). - Name: This variable holds the name of the passenger.
- Sex: This variable has two possible values male or female.
- Age: This variable holds the age of the passenger.
- SibSp: This holds the number of siblings/spouses aboard.
- Parch: This holds the number of parents/children aboard.
- Ticket: This holds the ticket number.
- Fare: This variable holds the passenger's fare.
- Cabin: This variable holds the cabin number.
- Embarked: This is the port of embarkation. This variable has three possible values: C, Q, and S (C = Cherbourg; Q = Queenstown; S = Southampton).
For predictive purposes, there are two kinds of variables:
- Output variables or target variables: These are the variables we want to predict. In the passenger list, the variable Survived is an output variable. This means that we want to predict if a passenger will survive the sinking.
- Input variables: These are the variables we'll use to create a prediction. In the passenger list, the variable
sex
is an input variable.
Rattle refers to output variables as target variables. To avoid confusion, we're going to use the term target variable throughout this book. In this dataset, we've ten input variables (Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked) that we want to use to predict if this person is a potential customer or not. So in this example, our target variable is Survived.
In Titanic: Machine Learning from Disaster, the passenger list is pided into two CSV files: train.csv
and test.csv
. The file train.csv
contains 891 observations or passengers; for each observation, we have a value for the variable Survived. It means that we know if the passenger survived or not. The second file, test.csv
, contains only 418 customers, but in this file, we don't have the variable Survived. This means that we don't know if the passenger survived or not. The objective of the competition is to use the training file to create a model that predicts the value of the Survived variable in the test file. For this reason, the variable Survived is the target variable.
Rattle distinguishes two types of variables—numeric and categorical. A numeric variable describes a numerically measured value. In this dataset, Age, SibSp, Parch, and Fare are numeric variables.
A categorical variable is a variable that can be grouped into different categories. There are two types of categorical variables—ordinal and nominal. In an ordinal categorical variable the categories are represented by a number. In our dataset, Pclass is an ordinal categorical variable with three different categories or possible values 1, 2, and 3.
In a nominal categorical variable, the group is represented by a word label. In this dataset, Sex is an example of this type. This variable has only two possible values, and the values are the label, in this case, male and female.
- Linux C/C++服務器開發實踐
- Java Web基礎與實例教程(第2版·微課版)
- C# 從入門到項目實踐(超值版)
- OpenCV for Secret Agents
- Xamarin.Forms Projects
- 零基礎學Python網絡爬蟲案例實戰全流程詳解(入門與提高篇)
- Swift Playgrounds少兒趣編程
- Unity 2017 Mobile Game Development
- Mastering Unity 2D Game Development(Second Edition)
- Android Wear Projects
- Applied Deep Learning with Python
- Java 11 and 12:New Features
- H5頁面設計與制作(全彩慕課版·第2版)
- Mastering Magento Theme Design
- 熱處理常見缺陷分析與解決方案