官术网_书友最值得收藏!

Understanding decision trees

I chose to start this book with decision trees because I've noticed that the majority of new machine learning practitioners have previous experience in one of two fields—software development, or statistics and mathematics. Decision trees can conceptually resemble some of the concepts software developers are used to, such as nested if-else conditions and binary search trees. As for the statisticians, bear with me—soon, you will feel at home when we reach the chapter about linear models.

What are decision trees?

I think the best way to explain what decision trees are is by showing the rules they generate after they are trained. Luckily, we can access those rules and print them. Here is an example of how decision tree rules look:

Shall I take an umbrella with me?
|--- Chance of Rainy <= 0.6
| |--- UV Index <= 7.0
| | |--- class: False
| |--- UV Index > 7.0
| | |--- class: True
|--- Chance of Rainy > 0.6
| |--- class: True

As you can see, it's basically a set of conditions. If the chance of rain falling is above 0.6 (60%), then I need to take an umbrella with me. If it is below 0.6, then it all depends on the UV index. If the UV index is above 7, then an umbrella is needed; otherwise, I will be fine without one. Now, you might be thinking well, a few nested if-else conditions will do the trick. True, but the main difference here is that I didn't write any of these conditions myself. The algorithm just learned the preceding conditions automatically after it went through the following data:

Of course, for this simple case, anyone can manually go through the data and come up with the same conditions. Nevertheless, when dealing with a bigger dataset, the number of conditions we need to program will quickly grow with the number of columns and the values in each column. At such a scale, it is not possible to manually perform the same job, and an algorithm that can learn the conditions from the data is needed.

Conversely, it is also possible to map a constructed tree back to the nested if-else conditions. This means that you can use Python to build a tree from data, then export the underlying conditions to be implemented in a different language or even to put them in Microsoft Excel if you want.

Iris classification

scikit-learn comes loaded with a number of datasets that we can use to test new algorithms. One of these datasets is the Iris set. Iris is a genus of 260–300 species of flowering plants with showy flowers. However, in our dataset, just three species are covered—Setosa, Versicolor, and Virginica. Each example in our dataset has the length and the widths of the sepal and petal of each plant (the features), along with whether it is a Setosa, a Versicolor, or a Virginica (the target). Our task is to be able to identify the species of a plant given its sepal and petal dimensions. Clearly, this is a classification problem. It is a supervised learning problem since the targets are provided with the data. Furthermore, it is a classification problem since we take a limited number of predefined values (three species).

Loading the Iris dataset

Let's now startby loading the dataset:

  1. We import the dataset's module from scikit-learn, and then load the Iris data into a variable, which we are going to call iris as well:
from sklearn import datasets
import pandas as pd
iris = datasets.load_iris()
  1. Usingdir, we can see what methods and attributes the dataset provides:
dir(iris)

We get a list of the DESCR, data, feature_names, filename, target, and target_namesmethods.

It's nice of the data creators to provide descriptions with each one, which we can access using DESCR.This is rarely the case with real-life data, however. Usually, in real life, we need to talk to the people who produced the data in the first place to understand what each value means, or at least use some descriptive statistics to understand the data before using it.

  1. For now, let's print the Iris data's description:
print(iris.DESCR)

Have a look at the description now and try to think of some of the main takeaways from it. I will list my own takeaways afterward:

.. _iris_dataset:
Iris plants dataset
--------------------
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolor
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.

:Creator: R.A. Fisher

This description holds some useful information for us, and I found the following points the most interesting:

  • The data is composed of 150 rows (or 150 samples). This is a reasonably small dataset. Later on, we will see how to deal with this fact when evaluating our model.
  • The class labels or targets take three values—Iris-Setosa, Iris-Versicolor, and Iris-Virginica. Some classification algorithms can only deal with two class labels; we call them binary classifiers. Luckily, the decision tree algorithm can deal with more than two classes, so we have no problems this time.
  • The data is balanced; there are 50 samples for each class. This is something we need to keep in mind when training and evaluating our model later on.
  • We have four features—sepal length, sepal width, petal length, and petal width—and all four features are numeric. In Chapter 3, Preparing Your Data, we will learn how to deal with non-numeric data.
  • There are no missing attribute values. In other words, none of our samples contains null values. Later on in this book, we will learn how to deal with missing values if we encounter them.
  • The petal dimensions correlate with the class values more than the sepal dimensions. I wish we had never seen this piece of information. Understanding your data is useful, but the problem here is that this correlation is calculated for the entire dataset. Ideally, we will only calculate it for our training data. Anyway, let's ignore this information for now and just use it for a sanity check later on.
  1. It's time to put all the dataset information into one DataFrame.

The feature_namesmethodreturns the names of our features, while the data method returns their values in the form of a NumPy array. Similarly, the targetvariable has the values of the target in the form of zeros, ones, and twos, and target_names maps 0, 1, and 2 toIris-Setosa, Iris-Versicolor, and Iris-Virginica, respectively.

NumPy arrays are efficient to deal with, but they do not allow columns to have names. I find column names to be useful for debugging purposes. I find pandas DataFrames to be more suitable here since we can use column names and combine the features and target into one DataFrame.

Here, we can see the first eight rows we get usingiris.data[:8]:

array([[5.1, 3.5, 1.4, 0.2], [4.9, 3. , 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [4.6, 3.1, 1.5, 0.2], [5. , 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4], [4.6, 3.4, 1.4, 0.3], [5. , 3.4, 1.5, 0.2]])

The following code uses the data, feature_names, and target methods to combine all the dataset information into one DataFrame and assign its column names accordingly:

df = pd.DataFrame(
iris.data,
columns=iris.feature_names
)

df['target'] = pd.Series(
iris.target
)
scikit-learn versions 0.23 and up support loading datasets as pandas DataFrames right away. You can do this by setting as_frame=True in datasets.load_iris and its similar data-loading methods. Nevertheless, this has not been tested in this book since version 0.22 is the most stable release at the time of writing.
  1. The target column now has the class IDs. However, for more clarity, we can also create a new column called target_names, where we can map our numerical target values to the class names:
df['target_names'] = df['target'].apply(lambda y: iris.target_names[y])
  1. Finally, let's print a sample of six rows to see how our new DataFrame looks. Running the following code in a Jupyter notebook or a Jupyter lab will just print the contents of the DataFrame; otherwise, you need to surround your code with a print statement. I will assume that a Jupyter notebook environment is used in all later code snippets:
# print(df.sample(n=6))
df.sample(n=6)

This gave me the following random sample:

The sample methods picked six random rows to display. This means that you will get a different set of rows each time you run the same code. Sometimes, we need to get the same random results every time we run the same code. Then, we use a pseudo-random number generator with a preset seed. A pseudo-random number generator initialized with the same seed will produce the same results every time it runs.

So, set therandom_stateparameter in thesample()method to 42, as follows:

df.sample(n=6, random_state=42) 

You will get the exact same rows shown earlier.

Splitting the data

Let's split the DataFrame we have just created into two—70% of the records (that is, 105 records) should go into the training set, while 30% (45 records) should go into testing. The choice of 70/30 is arbitrary for now. We will use the train_test_split() function provided by scikit-learn and specifytest_size to be 0.3:

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.3)

We can usedf_train.shape[0]and df_test.shape[0]to check how many rows thereare in the newly created DataFrames. We can also list the columns of the new DataFrames using df_train.columns and df_test.columns. They both have the same six columns:

  • sepal length (cm)
  • sepal width (cm)
  • petal length (cm)
  • petal width (cm)
  • target
  • target_names

The first four columns are our features, while the fifth column is our target (or label). The sixthcolumn will not be needed for now. Visually, you could say that we have split our data vertically into training and test sets. Usually, it makes sense to further split each of our DataFrames horizontally into two parts—one part for the features, which we usually call x, and another part for the targets, which is usually called y. We will continue to use this x and y naming convention throughout the rest of this book.

Some prefer to use a capital X to illustrate that it is a two-dimensional array (or DataFrames) and use a small letter for y when it is a single-dimensional array (or series). I find it more practical to stick to a single case.

As you know, thefeature_names method iniris contains a list of the corresponding column names to our features. We will use this information, along with the target label, to create our x and y sets, as follows:

x_train = df_train[iris.feature_names]
x_test = df_test[iris.feature_names]

y_train = df_train['target']
y_test = df_test['target']
Training the model and using it for prediction

To get a feel for how everything works, we will train our algorithm using its default configuration for now. Later on in this chapter, I will explain the details of the decision tree algorithms and how to configure them.

We need to import DecisionTreeClassifier first, and then create an instance of it, as follows:

from sklearn.tree import DecisionTreeClassifier

# It is common to call the classifier instance clf
clf = DecisionTreeClassifier()

One commonly used synonym for training is fitting. This is how an algorithm uses the training data (x and y) to learn its parameters. All scikit-learn models implement afit()method that takesx_trainandy_train, and DecisionTreeClassifier is no different:

clf.fit(x_train, y_train)

By calling the fit() method, the clf instance is trained and ready to be used for predictions. We then call thepredict()method onx_test:

# If y_test is our truth, then let's call our predictions y_test_pred
y_test_pred = clf.predict(x_test)

When predicting, we usually don't know the actual targets (y) for our features (x). That's why we only provide the predict() method here withx_test. In this particular case, we happened to knowy_test; nevertheless, we will pretend that we don't know it for now, and only use it later on for evaluation. As our actual targets are called y_test, we will call the predicted ones y_test_pred and compare the two later on.

Evaluating our predictions

As we have y_test_predict, all we need now is to compare it toy_test to check how good our predictions are. If you remember from the previous chapter, there are multiple metrics for evaluating a classifier, such asprecision,recall, andaccuracy. The Iris dataset is a balanced dataset; it has the same number of instances for each class. Therefore, it is apt to use the accuracy metric here.

Calculating the accuracy, as follows, gives us a score of0.91:

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_test_pred)
Did you get a different score than mine? Don't worry. In the Getting a more reliable score section, I will explain why the accuracy score calculated here may vary.

Congratulations! You've just trained your first supervised learning algorithm. From now on, all the algorithms we are going to use in this book have a similar interface:

  • The fit() method takes the x and y parts of your training data.
  • The predict() method takes x only and returns a predicted y.
Which features were more important?

We may now ask ourselves,Which features did the model find more useful in deciding the iris species? Luckily,DecisionTreeClassifier has a method called feature_importances_, which is computed after the classifier is fitted and scores how important each feature is to the model's decision. In the following code snippet, we will create a DataFrames where we will put the features' names and their importance together and then sort the features by their importance:

pd.DataFrame(
{
'feature_names': iris.feature_names,
'feature_importances': clf.feature_importances_
}
).sort_values(
'feature_importances', ascending=False
).set_index('feature_names')

This is the output we get:

As you will recall, when we printed the dataset's description, the petal length and width values started to correlate highly with the target. They also have high feature importance scores here, which confirms what is stated in the description.

Displaying the internal tree decisions

We can also print the internal structure of the learned tree using the following code snippet:

from sklearn.tree import export_text
print(
export_text(clf, feature_names=iris.feature_names, spacing=3, decimals=1)
)

This will print the following text:

|--- petal width (cm) <= 0.8
| |--- class: 0
|--- petal width (cm) > 0.8
| |--- petal width (cm) <= 1.8
| | |--- petal length (cm) <= 5.3
| | | |--- sepal length (cm) <= 5.0
| | | | |--- class: 2
| | | |--- sepal length (cm) > 5.0
| | | | |--- class: 1
| | |--- petal length (cm) > 5.3
| | | |--- class: 2
| |--- petal width (cm) > 1.8
| | |--- class: 2

If you print the complete dataset description, you will notice that toward the end, it says the following:

One class is linearly separable from the other two; the latter are NOT linearly separable from each other.

This means that one class is easier to separate from the other two, while the other two are harder to separate from each other. Now, look at the internal tree's structure. You may notice that in the first step, it decided that anything with a petal width below or equal to 0.8 belongs to class 0 (Setosa). Then, for petal widths above 0.8, the tree kept on branching, trying to differentiate between classes 1 and 2 (Versicolor and Virginica). Generally, the harder it is to separate classes, the deeper the branching goes.

主站蜘蛛池模板: 格尔木市| 互助| 镇远县| 博野县| 洪江市| 曲麻莱县| 全南县| 康平县| 高唐县| 那坡县| 陇川县| 太谷县| 东山县| 宜都市| 荔浦县| 红河县| 师宗县| 寻乌县| 阳新县| 信阳市| 舒城县| 宁武县| 吉木萨尔县| 木兰县| 霍州市| 宣武区| 沽源县| 行唐县| 漾濞| 肥东县| 体育| 大方县| 大化| 新田县| 鄂伦春自治旗| 榆树市| 达州市| 阜南县| 通渭县| 常熟市| 若羌县|