官术网_书友最值得收藏!

Using logistic regression for classification

"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions."
Naguib Mahfouz

One day, when applying for a job, an interviewer asks: So tell me, is logistic regression a classification or a regression algorithm? The short answer to this is that it is a classification algorithm, but a longer and more interesting answer requires a good understanding of the logistic function. Then, the question may end up having a different meaning altogether.

Understanding the logistic function

The logistic function is a member of the sigmoid (s-shaped) functions, and it is represented by the following formula:

Don't let this equation scare you. What actually matters is how this function looks visually. Luckily, we can use our computer to generate a bunch of values for theta—for example, between -10 and 10. Then, we can plug these values into the formula and plot the resulting y values versus the theta values, as we have done in the following code:

import numpy as np
import pandas as pd

fig, ax = plt.subplots(1, 1, figsize=(16, 8))

theta = np.arange(-10, 10, 0.05)
y = 1 / (1 + np.exp(-1 * theta))

pd.DataFrame(
{
'theta': theta,
'y': y
}
).plot(
title='Logistic Function',
kind='scatter', x='theta', y='y',
ax=ax
)

fig.show()

Running this code gives us the following graph:

Two key characteristics to notice in the logistic function are as follows:

  • y only goes between 0 and 1. It approaches 1 as theta approaches infinity, and approaches 0 as theta approaches negative infinity.
  • y takes the value of 0.5 when theta is 0.

Plugging the logistic function into a linear model

"Probability is not a mere computation of odds on the dice or more complicated variants; it is the acceptance of the lack of certainty in our knowledge and the development of methods for dealing with our ignorance."
– Nassim Nicholas Taleb

For a line model with a couple of features, x1 and x2, we can have an intercept and two coefficients. Let's call them , , and . Then, the linear regression equation will be as follows:

Separately, we can also plug the right-hand side of the preceding equation into the logistic function in place of . This will give the following equation for y:

In this case, the variation in the values of x will move y between 0 and 1. Higher values for the products of x and its coefficients will move y closer to 1, and lower values will move it toward 0. We also know that probabilities take values between 0 and 1. So, it makes sense to interpret y as the probability of y belonging to a certain class, given the value of x. If we don't want to deal with probabilities, we can just specify ; then, our sample belongs to class 1, and it belongs to class 0 otherwise.

This was a brief look at how logistic regression works. It is a classifier, yet it is called regression since it's basically a regressor returning a value between 0 and 1, which we interpret as probabilities.

To train the logistic regression model, we need an objective function, as well as a solver that tries to find the optimal coefficients to minimize this function. In the following sections, we will go through all of these in more detail.

Objective function

During the training phase, the algorithm loops through the data trying to find the coefficients that minimize a predefined objective (loss) function. The loss function we try to minimize in the case of logistic regression is called log loss. It measures how far the predicted probabilities (p) are from the actual class labels (y) using the following formula:

-log(p) if y == 1 else -log(1 - p)

Mathematicians use a rather ugly way to express this formula due to their lack of if-else conditions. So, I chose to display the Python form here for its clarity. Jokes aside, the mathematical formula will turn out to be beautiful once you know its informational theory roots, but that's not something we'll look at now.

Regularization

Furthermore, scikit-learn's implementation of logistic regression algorithms uses regularization by default. Out of the box, it uses L2 regularization (as in the ridge regressor), but it can also use L1 (as in lasso) or a mixture of L1 and L2 (as in elastic-net).

Solvers

Finally, how do we find the optimal coefficients to minimize our loss function? A naive approach would be to try all the possible combinations of the coefficients until the minimal loss is found. Nevertheless, since an exhaustive search is not feasible given the infinite combinations, solvers are there to efficiently search for the best coefficients. scikit-learn implements about half a dozen solvers.

The choice of solver, along with the regularization method used, are the two main decisions to take when configuring the logistic regression algorithm. In the next section, we are going to see how and when to pick each one.

Configuring the logistic regression classifier

Before talking about solvers, let's go through some of the common hyperparameters used:

  • fit_intercept: Usually, in addition to the coefficient for each feature, there is a constant intercept in your equation. Nevertheless, there are cases where you might not need an intercept—for example, if you know for sure that the value of y is supposed to be 0.5 when all the values of x are 0. One other case is when your data already has an additional constant column with all values set to 1. This usually occurs if your data has been processed in an earlier stage, as in the case of the polynomial processor. The coefficient for the constant column will be interpreted as the intercept in this case. The same configuration exits for the linear regression algorithms explained earlier.
  • max_iter: For the solver to find the optimum coefficients, it loops over the training data more than once. These iterations are also called epochs. You usually set a limit on the number of iterations to prevent overfitting. The same hyperparameter is used by the lasso and ridge regressors explained earlier.
  • tol: This is another way to stop the solver from iterating too much. If you set this to a high value, it means that only high improvements between one iteration and the next are tolerated; otherwise, the solver will stop. Conversely, a lower value will keep the solver going for more iterations until it reaches max_iter.
  • penalty: This picks the regularization techniques to be used. This can be either L1, L2, elastic-net, or none for no regularization. Regularization helps to prevent overfitting, so it is important to use it when you have a lot of features. It also mitigates the overfitting effect when max_iter andtol are set to high values.
  • C or alpha: These are parameters for setting how strong you want the regularization to be. Since we are going to use two different implementations of the logistic regression algorithm here, it is important to know that each of these two implementations uses a different parameter (C versus alpha). alpha is basically the inverse of C—(). This means that smaller values for C specify stronger regularization, while for alpha, larger values are needed for stronger regularization.
  • l1_ratio: When using a mixture of L1 and L2, as in elastic-net, this fraction specifies how much weight to give to L1 versus L2.

The following are some of the solvers we can use:

  • liblinear:This solver is implemented in LogisticRegression and is recommended for smaller datasets. It supports L1 and L2 regularization, but you cannot use it if you want to use elastic-net, nor if you do not want to use regularization at all.
  • sag or saga: These solvers are implemented in LogisticRegression and RidgeClassifier. They are faster for larger datasets. However, you need to scale your features for them to converge. We used MinMaxScaler earlier in this chapter to scale our features. Now, it is not only needed for more meaningful coefficients, but also for the solver to find a solution earlier. saga supports all four penalty options.
  • lbfgs:This solver is implemented in LogisticRegression. It supports the L2 penalty or no regularization at all.
  • Stochastic Gradient Descent (SGD): There are dedicated implementations for SGD—SGDClassifier and SGDRegressor. This is different to LogisticRegression, where the focus is on performing logistic regression by optimizing the one-loss function—log loss. The focus of SGDClassifier is on the SGD solver itself, which means that the same classifier allows different loss functions to be used. If loss is set to log, then it is a logistic regression model. However, setting loss to hinge or perceptron turns it into a Support Vector Machine (SVM) or perceptron, respectively. These are two other linear classifiers.
Gradie nt descent is an optimization algorithm that aims to find a local minimum in a function by iteratively moving in the direction of steepest descent. The direction of the steepest descent is found using calculus, hence the term gradient. If you imagine the objective (loss) function as a curve, the gradient descent algorithm blindly lands on a random point on this curve and uses the gradient at the point it is on as a guiding stick to move to a local minimum step by step. Usually, the loss function is chosen to be a convex one so that its local minima is also its global one. Inthe stochastic version of gradient descent, rather than calculating the gradient for the entire training data, the estimator's weights are updated with each training sample. Gradient descent is covered in more detail in Chapter 7, Neural Networks – Here Comes the Deep Learning.

Classifying the Iris dataset using logistic regression

We will load the Iris dataset into a data frame. The following is a similar block of code to the one used in Chapter 2, Making Decisions with Trees, to load the dataset:

from sklearn import datasets
iris = datasets.load_iris()

df = pd.DataFrame(
iris.data,
columns=iris.feature_names
)

df['target'] = pd.Series(
iris.target
)

Then, we will use cross_validate to evaluate the accuracy of the LogisticRegression algorithm using six-fold cross-validation, as follows:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

num_folds = 6

clf = LogisticRegression(
solver='lbfgs', multi_class='multinomial', max_iter=1000
)
accuracy_scores = cross_validate(
clf, df[iris.feature_names], df['target'],
cv=num_folds, scoring=['accuracy']
)

accuracy_mean = pd.Series(accuracy_scores['test_accuracy']).mean()
accuracy_std = pd.Series(accuracy_scores['test_accuracy']).std()
accuracy_sterror = accuracy_std / np.sqrt(num_folds)

print(
'Logistic Regression: Accuracy ({}-fold): {:.2f} ~ {:.2f}'.format(
num_folds,
(accuracy_mean - 1.96 * accuracy_sterror),
(accuracy_mean + 1.96 * accuracy_sterror),
)
)

Running the preceding code will give us a set of accuracy scores with a 95% confidence interval that ranges between 0.95 and 1.00. Running the same code for the decision tree classifier gives us a confidence interval that ranges between 0.93 and 0.99.

Since we have three classes here, the coefficients calculated for each class boundary are separate from the others. After we train the logistic regression algorithm once more without the cross_validate wrapper, we can access the coefficients via coef_. We can also access the intercepts via intercept_.

In the next code snippet, I will be using a dictionary comprehension. In Python, one way to create the [0, 1, 2, 3] list is by using the [i for i in range(4)] list comprehension. This basically executes the loop to populate the list. Similarly, the ['x' for i in range(4)] list comprehension will create the ['x', 'x', 'x, 'x'] list. Dictionary comprehension works in the same fashion. For example, the {str(i): i for i in range(4)} line of code will create the {'0': 0, '1': 1, '2': 2, '3': 3} dictionary.

The following code puts the coefficients into a data frame. It basically creates a dictionary whose keys are the class IDs and maps each ID to a list of its corresponding coefficients. Once the dictionary is created, we convert it into a data frame and add the intercepts to the data frame before displaying it:

# We need to fit the model again before getting its coefficients
clf.fit(df[iris.feature_names], df['target'])

# We use dictionary comprehension instead of a for-loop
df_coef = pd.DataFrame(
{
f'Coef [Class {class_id}]': clf.coef_[class_id]
for class_id in range(clf.coef_.shape[0])
},
index=iris.feature_names
)
df_coef.loc['intercept', :] = clf.intercept_

Don't forget to scale your features before training. Then, you should get a coefficient data frame that looks like this:

The table in the preceding screenshot shows the following:

  • From the first row, we can tell that the increase in sepal length is correlated with classes 1 and 2 more than the remaining class, based on the positive sign of classes 1 and class 2's coefficients.
  • Having a linear model here means that the class boundaries will not be limited to horizontal and vertical lines, as in the case of decision trees, but they will take linear forms.

To better understand this, in the next section, we will draw the logistic regression classifier's decision boundaries and compare them to those of decision trees.

Understanding the classifier's decision boundaries

By seeing the decision boundaries visually, we can understand why the model makes certain decisions. Here are the steps for plotting those boundaries:

  1. We start by creating a function that takes the classifier's object and data samples and then plots the decision boundaries for that particular classifier and data:
def plot_decision_boundary(clf, x, y, ax, title):

cmap='Paired_r'
feature_names = x.columns
x, y = x.values, y.values

x_min, x_max = x[:,0].min(), x[:,0].max()
y_min, y_max = x[:,1].min(), x[:,1].max()

step = 0.02

xx, yy = np.meshgrid(
np.arange(x_min, x_max, step),
np.arange(y_min, y_max, step)
)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

ax.contourf(xx, yy, Z, cmap=cmap, alpha=0.25)
ax.contour(xx, yy, Z, colors='k', linewidths=0.7)
ax.scatter(x[:,0], x[:,1], c=y, edgecolors='k')
ax.set_title(title)
ax.set_xlabel(feature_names[0])
ax.set_ylabel(feature_names[1])
  1. Then, we split our data into training and test sets:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.3, random_state=22)
  1. To be able to visualize things easily, we are going to use two features. In the following code, we will train a logistic regression model and a decision tree model, and then compare their decision boundaries when trained on the same data:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

fig, axs = plt.subplots(1, 2, figsize=(12, 6))

two_features = ['petal width (cm)', 'petal length (cm)']

clf_lr = LogisticRegression()
clf_lr.fit(df_train[two_features], df_train['target'])
accuracy = accuracy_score(
df_test['target'],
clf_lr.predict(df_test[two_features])
)
plot_decision_boundary(
clf_lr, df_test[two_features], df_test['target'], ax=axs[0],
title=f'Logistic Regression Classifier\nAccuracy: {accuracy:.2%}'
)

clf_dt = DecisionTreeClassifier(max_depth=3)
clf_dt.fit(df_train[two_features], df_train['target'])
accuracy = accuracy_score(
df_test['target'],
clf_dt.predict(df_test[two_features])
)
plot_decision_boundary(
clf_dt, df_test[two_features], df_test['target'], ax=axs[1],
title=f'Decision Tree Classifier\nAccuracy: {accuracy:.2%}'
)

fig.show()

Running this code will give us the following graphs:

In the preceding graph, the following is observed:

  • The logistic regression model did not perform well this time when only two features were used. Nevertheless, what we care about here is the shape of the boundaries.
  • It's clear that the boundaries on the left are not horizontal and vertical lines as on the right. While the ones on the right can be composed of multiple line fragments, the ones on the left can only be made of continuous lines.
主站蜘蛛池模板: 蛟河市| 广水市| 长沙县| 上虞市| 卓尼县| 延寿县| 永城市| 南岸区| 湟中县| 家居| 长春市| 岳西县| 宁城县| 建瓯市| 舒城县| 清徐县| 饶河县| 东丽区| 屯门区| 潞西市| 昌宁县| 鄢陵县| 新民市| 乌海市| 六枝特区| 楚雄市| 虞城县| 郁南县| 荣昌县| 闽清县| 乌恰县| 永善县| 时尚| 东莞市| 黑山县| 光山县| 南投市| 尤溪县| 元阳县| 阿勒泰市| 榆林市|