官术网_书友最值得收藏!

The diamond dataset

Let's make actual predictions about diamond prices by using different ensemble learning models. We will use a diamonds dataset(which can be found here: https://www.kaggle.com/shivam2503/diamonds). This dataset has the prices, among other features, of almost 54,000 diamonds. The following are the features that we have in this dataset:

  • Feature information: A dataframe with 53,940 rows and 10 variables
  • Price: Price in US dollars

The following are the nine predictive features:

  • carat: This feature represents weight of the diamond (0.2-5.01)
  • cut: This feature represents quality of the cut (Fair, Good, Very Good, Premium, and Ideal)
  • color: This feature represents diamond color, from J (worst) to D (best)
  • clarity: This feature represents a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • x: This feature represents length of diamond in mm (0-10.74)
  • y: This feature represents width of diamond in mm (0-58.9)
  • z: This feature represents depth of diamond in mm (0-31.8)
  • depthThis feature represents z/mean(x, y) = 2 * z/(x + y) (43-79)
  • table: This feature represents width of the top of the diamond relative to the widest point (43-95)

The x, y, and z variables denote the size of the diamonds.

The libraries that we will use are numpy, matplotlib, and pandas. For importing these libraries, the following lines of code can be used:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

The following screenshot shows the lines of code that we use to call the raw dataset:

The preceding dataset has some numerical features and some categorical features. Here, 53,940 is the exact number of samples that we have in this dataset. Now, for encoding the information in these categorical features, we use the one-hot encoding technique to transform these categorical features into dummy features. The reason behind this is because scikit-learn only works with numbers.

The following screenshot shows the lines of code used for the transformation of the categorical features to numbers:

Here, we can see how we can do this with the get_dummies function from pandas. The final dataset looks similar to the one in the following screenshot:

Here, for each of the categories in the categorical variable, we have dummy features. The value here is 1 when the category is present and 0 when the category is not present in the particular diamond.

Now, for rescaling the data, we will use the RobustScaler method to transform all the features to a similar scale. 

The following screenshot shows the lines of code used for importing the train_test_split function and the RobustScaler method:

Here, we extract the features in the X matrix, mention the target, and then use the train_test_split function from scikit-learn to partition the data into two sets.

主站蜘蛛池模板: 香格里拉县| 河南省| 南溪县| 平凉市| 孟州市| 海宁市| 连州市| 年辖:市辖区| 建水县| 昌江| 广平县| 屯留县| 蒙山县| 怀集县| 泸水县| 禄丰县| 兴安县| 富顺县| 独山县| 民乐县| 齐齐哈尔市| 三门县| 赣榆县| 静海县| 华阴市| 泌阳县| 兴城市| 定兴县| 城口县| 逊克县| 广平县| 铜梁县| 赤峰市| 玉树县| 金门县| 新郑市| 祥云县| 泰顺县| 丹寨县| 黄浦区| 巧家县|