- Mastering Predictive Analytics with scikit:learn and TensorFlow
- Alan Fontaine
- 471字
- 2021-07-23 16:42:24
The diamond dataset
Let's make actual predictions about diamond prices by using different ensemble learning models. We will use a diamonds dataset(which can be found here: https://www.kaggle.com/shivam2503/diamonds). This dataset has the prices, among other features, of almost 54,000 diamonds. The following are the features that we have in this dataset:
- Feature information: A dataframe with 53,940 rows and 10 variables
- Price: Price in US dollars
The following are the nine predictive features:
- carat: This feature represents weight of the diamond (0.2-5.01)
- cut: This feature represents quality of the cut (Fair, Good, Very Good, Premium, and Ideal)
- color: This feature represents diamond color, from J (worst) to D (best)
- clarity: This feature represents a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x: This feature represents length of diamond in mm (0-10.74)
- y: This feature represents width of diamond in mm (0-58.9)
- z: This feature represents depth of diamond in mm (0-31.8)
- depth: This feature represents z/mean(x, y) = 2 * z/(x + y) (43-79)
- table: This feature represents width of the top of the diamond relative to the widest point (43-95)
The x, y, and z variables denote the size of the diamonds.
The libraries that we will use are numpy, matplotlib, and pandas. For importing these libraries, the following lines of code can be used:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
The following screenshot shows the lines of code that we use to call the raw dataset:
The preceding dataset has some numerical features and some categorical features. Here, 53,940 is the exact number of samples that we have in this dataset. Now, for encoding the information in these categorical features, we use the one-hot encoding technique to transform these categorical features into dummy features. The reason behind this is because scikit-learn only works with numbers.
The following screenshot shows the lines of code used for the transformation of the categorical features to numbers:
Here, we can see how we can do this with the get_dummies function from pandas. The final dataset looks similar to the one in the following screenshot:
Here, for each of the categories in the categorical variable, we have dummy features. The value here is 1 when the category is present and 0 when the category is not present in the particular diamond.
Now, for rescaling the data, we will use the RobustScaler method to transform all the features to a similar scale.
The following screenshot shows the lines of code used for importing the train_test_split function and the RobustScaler method:
Here, we extract the features in the X matrix, mention the target, and then use the train_test_split function from scikit-learn to partition the data into two sets.
- Word 2003、Excel 2003、PowerPoint 2003上機指導與練習
- WOW!Illustrator CS6完全自學寶典
- Cloud Analytics with Microsoft Azure
- 程序設計語言與編譯
- Julia 1.0 Programming
- Apache Hive Essentials
- Windows程序設計與架構
- 21天學通Visual C++
- Android游戲開發案例與關鍵技術
- SAP Business Intelligence Quick Start Guide
- Linux嵌入式系統開發
- 格蠹匯編
- 漢字錄入技能訓練
- 菜鳥起飛電腦組裝·維護與故障排查
- Learning iOS 8 for Enterprise