官术网_书友最值得收藏!

The diamond dataset

Let's make actual predictions about diamond prices by using different ensemble learning models. We will use a diamonds dataset(which can be found here: https://www.kaggle.com/shivam2503/diamonds). This dataset has the prices, among other features, of almost 54,000 diamonds. The following are the features that we have in this dataset:

  • Feature information: A dataframe with 53,940 rows and 10 variables
  • Price: Price in US dollars

The following are the nine predictive features:

  • carat: This feature represents weight of the diamond (0.2-5.01)
  • cut: This feature represents quality of the cut (Fair, Good, Very Good, Premium, and Ideal)
  • color: This feature represents diamond color, from J (worst) to D (best)
  • clarity: This feature represents a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • x: This feature represents length of diamond in mm (0-10.74)
  • y: This feature represents width of diamond in mm (0-58.9)
  • z: This feature represents depth of diamond in mm (0-31.8)
  • depthThis feature represents z/mean(x, y) = 2 * z/(x + y) (43-79)
  • table: This feature represents width of the top of the diamond relative to the widest point (43-95)

The x, y, and z variables denote the size of the diamonds.

The libraries that we will use are numpy, matplotlib, and pandas. For importing these libraries, the following lines of code can be used:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

The following screenshot shows the lines of code that we use to call the raw dataset:

The preceding dataset has some numerical features and some categorical features. Here, 53,940 is the exact number of samples that we have in this dataset. Now, for encoding the information in these categorical features, we use the one-hot encoding technique to transform these categorical features into dummy features. The reason behind this is because scikit-learn only works with numbers.

The following screenshot shows the lines of code used for the transformation of the categorical features to numbers:

Here, we can see how we can do this with the get_dummies function from pandas. The final dataset looks similar to the one in the following screenshot:

Here, for each of the categories in the categorical variable, we have dummy features. The value here is 1 when the category is present and 0 when the category is not present in the particular diamond.

Now, for rescaling the data, we will use the RobustScaler method to transform all the features to a similar scale. 

The following screenshot shows the lines of code used for importing the train_test_split function and the RobustScaler method:

Here, we extract the features in the X matrix, mention the target, and then use the train_test_split function from scikit-learn to partition the data into two sets.

主站蜘蛛池模板: 寿宁县| 鄂伦春自治旗| 昔阳县| 石狮市| 陇西县| 彭泽县| 乡城县| 将乐县| 岢岚县| 兴业县| 绥江县| 娄烦县| 盐山县| 洞头县| 永善县| 博罗县| 定西市| 台南市| 东丽区| 瑞昌市| 政和县| 常宁市| 满城县| 礼泉县| 富蕴县| 蒲江县| 驻马店市| 沙坪坝区| 平塘县| 镇坪县| 临沧市| 岳阳县| 塘沽区| 华容县| 静乐县| 临夏县| 竹溪县| 霍林郭勒市| 文成县| 庆城县| 南宁市|