官术网_书友最值得收藏!

Loading data

In Rattle, you have to explicitly declare the role of each variable. A variable can have five different roles:

  • Input: The prediction process will use input variables to predict the value of the target variable.
  • Target: The target variable is the output of our model.
  • Risk: The risk variable is a measure of the target variable.
  • Ident or Identifier: An identifier is a variable that identifies a unique occurrence of an object. In our preceding example, the variable Person is an identifier that identifies a unique person.
  • Ignore: A variable marked Ignore will be ignored by the model. We'll come back to this role later-some variables can create noise and decrease the performance of your predictive model.

Rattle can load data from many data sources. Here are some options:

  • Use the Spreadsheet option to load data from a Comma Separated Value (CSV) file.
  • Open Database Connectivity (ODBC) is a standard to define database connectivity. Using this standard, you can load from most common databases. This will allow you to load data from ERP, CRM, data warehouse systems, and others.
  • Use Attribute-Relation File Format (ARFF) to load data from Weka files. Weka is a machine learning software written in Java.
  • You can also load R Datasets; these are tables loaded in memory using R. Currently, Rattle supports R data frames.
  • The RData file option allows you to load an R Dataset that has been saved in a file, usually with the .Rdata extension.
  • With the Library option, Rattle can load sample datasets provided by R packages.
  • The Corpus option allows loading and processing a folder of documents.
  • In the following screenshot, you can see a Script option, but this option is not implemented. It will be available in a future version.

In this book, we're going to load data from the CSV files to explain Rattle's functionalities. CSV is widely used to load data, and we'll find example datasets on the Internet as CSV files.

Loading a CSV File

As we've seen before, we'll use a CSV file from Kaggle to learn how to load a dataset into Rattle. Download the file train.csv from the competition page at http://www.kaggle.com/c/titanic-gettingStarted.

The steps to load the train.csv file are as follows:

  1. Open Rattle and go to the Data tab:
  2. Select Spreadsheet as the data source and click on the Filename folder icon.
  3. Select the file train.csv and click on Open:
  4. Finally, click the Execute button to load the dataset:

Rattle loads the data from the file, analyzes it, and guesses the structure of the dataset. Now we can start exploring the structure of our data. In the Rattle window, we can see that the loaded dataset has 891 observations with nine input variables and Survived as the target variable. We can change the role of each variable with the radio buttons. Note that Age, Cabin, and Embarked have missing values:

We'll focus on these missing values in the next section of this chapter.

The objective of this dataset is to predict whether or not a passenger will survive the sinking of the Titanic. Our target variable is survived and has two possible values:

  • 0 (not survived)
  • 1 (survived)

The variable name is an identifier that identifies a unique passenger. For this reason, it has 891 observations and 891 different values.

Make changes in the roles of the different variables and click on the Execute button to update the data. To save your work, click on the Save button and give it an appropriate file name.

The Save button will save our work, but it will not modify the data source (the CSV file).

In Rattle's Data tab, there are two useful buttons—View and Edit. With these buttons, you can edit and visualize your data. We also have a Partition check box, as you can see in the following screenshot:

Generally, we split the datasets into three subsets of data—a training dataset, a validation dataset, and a testing dataset. We're going to leave this option for now and we'll come back to partitioning in Chapter 5, Clustering and Other Unsupervised Learning Methods, and Chapter 6, Decision Trees and Other Supervised Learning Methods.

The last option in data loading is Weight Calculator. This option allows us to enter a formula to give more importance to some observations.

Tip

You can assign roles to variables automatically by modifying their names in the data source. When you load a variable with a name that starts with ID, Rattle marks it automatically as having a role of ident. You can also mark a variable as target, risk, and ignore using Target, Risk, and Ignore.

主站蜘蛛池模板: 南京市| 隆昌县| 龙川县| 宁德市| 清苑县| 城固县| 北碚区| 精河县| 宝丰县| 西宁市| 瑞昌市| 温宿县| 丰县| 汉川市| 喀喇沁旗| 河源市| 宁夏| 榆林市| 金寨县| 湟源县| 裕民县| 会宁县| 乌拉特后旗| 方山县| 元氏县| 如皋市| 梅河口市| 兴宁市| 新乡市| 图们市| 环江| 博乐市| 金堂县| 临沭县| 三门县| 九台市| 勐海县| 无为县| 东港市| 澄江县| 辉县市|