- Mastering Predictive Analytics with scikit:learn and TensorFlow
- Alan Fontaine
- 471字
- 2021-07-23 16:42:24
The diamond dataset
Let's make actual predictions about diamond prices by using different ensemble learning models. We will use a diamonds dataset(which can be found here: https://www.kaggle.com/shivam2503/diamonds). This dataset has the prices, among other features, of almost 54,000 diamonds. The following are the features that we have in this dataset:
- Feature information: A dataframe with 53,940 rows and 10 variables
- Price: Price in US dollars
The following are the nine predictive features:
- carat: This feature represents weight of the diamond (0.2-5.01)
- cut: This feature represents quality of the cut (Fair, Good, Very Good, Premium, and Ideal)
- color: This feature represents diamond color, from J (worst) to D (best)
- clarity: This feature represents a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x: This feature represents length of diamond in mm (0-10.74)
- y: This feature represents width of diamond in mm (0-58.9)
- z: This feature represents depth of diamond in mm (0-31.8)
- depth: This feature represents z/mean(x, y) = 2 * z/(x + y) (43-79)
- table: This feature represents width of the top of the diamond relative to the widest point (43-95)
The x, y, and z variables denote the size of the diamonds.
The libraries that we will use are numpy, matplotlib, and pandas. For importing these libraries, the following lines of code can be used:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
The following screenshot shows the lines of code that we use to call the raw dataset:
The preceding dataset has some numerical features and some categorical features. Here, 53,940 is the exact number of samples that we have in this dataset. Now, for encoding the information in these categorical features, we use the one-hot encoding technique to transform these categorical features into dummy features. The reason behind this is because scikit-learn only works with numbers.
The following screenshot shows the lines of code used for the transformation of the categorical features to numbers:
Here, we can see how we can do this with the get_dummies function from pandas. The final dataset looks similar to the one in the following screenshot:
Here, for each of the categories in the categorical variable, we have dummy features. The value here is 1 when the category is present and 0 when the category is not present in the particular diamond.
Now, for rescaling the data, we will use the RobustScaler method to transform all the features to a similar scale.
The following screenshot shows the lines of code used for importing the train_test_split function and the RobustScaler method:
Here, we extract the features in the X matrix, mention the target, and then use the train_test_split function from scikit-learn to partition the data into two sets.