官术网_书友最值得收藏!

Datasets, observations, and variables

A dataset is a collection of data that we're going to use to create new predictions. There are different kinds of datasets. When we use a dataset for predictive analytics, we can consider a dataset like a table with columns and rows.

In a real-life problem, our dataset would be related to the problem we want to solve. If we want to predict which customer is most likely to buy a product, our dataset would probably contain customer and historic sales data. When we're learning, we need to find an appropriate dataset for our learning purposes. You can find a lot of example datasets on the Internet; in this chapter, and in the following one, we're going to use the Titanic passenger list as a dataset that has been taken from Kaggle.

Note

Kaggle is the world's largest community of data scientists. On this website, you can even find data science competitions. We're not going to use the term data science, in this book, because there are a lot of new terms around analytics and we want to focus just on a few to avoid noise. Currently, we use this term to refer to an engineering area dedicated to collect, clean, and manipulate data to discover new knowledge. On www.kaggle.com, you can find different types of competitions; there are introductory competitions for beginners and competitions with monetary prices. You can access a competition, download the data and the problem description, and create your own solutions. An example of an introductory Kaggle competition is Titanic: Machine Learning from Disaster. You can download this dataset at https://www.kaggle.com/c/titanic-gettingStarted. We're going to use this dataset in this chapter and in Chapter 3, Exploring and Understanding Your Data.

A dataset is a matrix where each row is an observation or member of the dataset. In the Titanic passenger list, each observation contains the data related to a passenger. In a dataset, each column is a particular variable. In the passenger list, the column Sex is a variable. You can see a part of the Titanic passenger list in the following screenshot:

Before we start, we need to understand our dataset. When we download a dataset from the Web, it usually has a variable description document.

The following is the variable description for our dataset:

  • Survived: If the passenger survived, the value of this variable is set to 1, and if the passenger did not survive, it is set to 0.
  • Pclass: This stands for the class the passenger was travelling by. This variable can have three possible values: 1, 2, and 3 (1 = first class; 2 = second class; 3 = third class).
  • Name: This variable holds the name of the passenger.
  • Sex: This variable has two possible values male or female.
  • Age: This variable holds the age of the passenger.
  • SibSp: This holds the number of siblings/spouses aboard.
  • Parch: This holds the number of parents/children aboard.
  • Ticket: This holds the ticket number.
  • Fare: This variable holds the passenger's fare.
  • Cabin: This variable holds the cabin number.
  • Embarked: This is the port of embarkation. This variable has three possible values: C, Q, and S (C = Cherbourg; Q = Queenstown; S = Southampton).

For predictive purposes, there are two kinds of variables:

  • Output variables or target variables: These are the variables we want to predict. In the passenger list, the variable Survived is an output variable. This means that we want to predict if a passenger will survive the sinking.
  • Input variables: These are the variables we'll use to create a prediction. In the passenger list, the variable sex is an input variable.

Rattle refers to output variables as target variables. To avoid confusion, we're going to use the term target variable throughout this book. In this dataset, we've ten input variables (Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked) that we want to use to predict if this person is a potential customer or not. So in this example, our target variable is Survived.

In Titanic: Machine Learning from Disaster, the passenger list is pided into two CSV files: train.csv and test.csv. The file train.csv contains 891 observations or passengers; for each observation, we have a value for the variable Survived. It means that we know if the passenger survived or not. The second file, test.csv, contains only 418 customers, but in this file, we don't have the variable Survived. This means that we don't know if the passenger survived or not. The objective of the competition is to use the training file to create a model that predicts the value of the Survived variable in the test file. For this reason, the variable Survived is the target variable.

Rattle distinguishes two types of variables—numeric and categorical. A numeric variable describes a numerically measured value. In this dataset, Age, SibSp, Parch, and Fare are numeric variables.

A categorical variable is a variable that can be grouped into different categories. There are two types of categorical variables—ordinal and nominal. In an ordinal categorical variable the categories are represented by a number. In our dataset, Pclass is an ordinal categorical variable with three different categories or possible values 1, 2, and 3.

In a nominal categorical variable, the group is represented by a word label. In this dataset, Sex is an example of this type. This variable has only two possible values, and the values are the label, in this case, male and female.

主站蜘蛛池模板: 库车县| 左贡县| 遂平县| 永兴县| 安义县| 肃南| 石首市| 乌苏市| 青田县| 修水县| 雅安市| 尼勒克县| 崇义县| 永新县| 丰城市| 中卫市| 玉龙| 辉南县| 霍州市| 双鸭山市| 通城县| 綦江县| 扎鲁特旗| 扎鲁特旗| 晋城| 和静县| 龙川县| 仙桃市| 辽宁省| 九龙城区| 扎兰屯市| 滦平县| 长顺县| 金寨县| 安泽县| 沾化县| 和田市| 通许县| 泾源县| 湖口县| 腾冲县|