書名： Supervised Machine Learning with Python
作者名： Taylor Smith
本章字?jǐn)?shù)： 976字
更新時(shí)間： 2021-06-24 14:01:04

Logistic regression

In this section, we will learn about logistic regression, which is another classification model that we're going to build from scratch. We will go ahead and fit the following code:

from packtml.regression import SimpleLogisticRegression
# simple logistic regression classifier
plot_learning_curve(
        SimpleLogisticRegression, metric=accuracy_score,
        X=X_train, y=y_train, n_folds=3, seed=21, trace=True,
        train_sizes=(np.linspace(.25, .8, 4) *     X_train.shape[0]).astype(int),
        n_steps=250, learning_rate=0.0025, loglik_interval=100)\
    .show()

This is much faster than the decision tree. In the following output, you can see that we converge a lot more around the 92.5% range. This looks a little more consistent than our decision tree, but it doesn't perform quite well enough on the validation set:

In the following screenshot, there are encoded records of spam emails. We will see how this encoding performs on an email that we can read and validate. So, if you have visited the UCI link that was included at the top of the Jupyter Notebook, it will provide a description of all the features inside the dataset. We have a lot of different features here that are counting the ratio of particular words to the number of words in the entire email. Some of those words might be free and some credited. We also have a couple of other features that are counting character frequencies, the number of exclamation points, and the number of concurrent capital runs.

So, if you have a really highly capitalized set of words, we have all these features:

In the following screenshot, we will create two emails. The first email is very obviously spam. Even if anyone gets this email, no one will respond to it:

spam_email = """
Dear small business owner,
    
This email is to inform you that for $0 down, you can receive a 
FREE CREDIT REPORT!!! Your money is important; PROTECT YOUR CREDIT and 
reply direct to us for assistance!
"""

print(spam_email)

The output of the preceding code snippet is as follows:

Dear small business owner,
    
This email is to inform you that for $0 down, you can receive a 
FREE CREDIT REPORT!!! Your money is important; PROTECT YOUR CREDIT and 
reply direct to us for assistance!

The second email looks less like spam:

The model that we have just fit is going to look at both of the emails and encode the features, and will classify which is, and which is not, spam.

The following function is going to encode those emails into the features we discussed. Initially, we're going to use a Counter function as an object, and tokenize our emails. All we're doing is splitting our email into a list of words, and then the words can be split into a list of characters. Later, we'll count the characters and words so that we can generate our features:

from collections import Counter
import numpy as np
def encode_email(email):
    # tokenize the email
    tokens = email.split()
    
    # easiest way to count characters will be to join everything
    # up and split them into chars, then use a counter to count them
    # all ONE time.
    chars = list("".join(tokens))
    char_counts = Counter(chars)
    n_chars = len(chars)
    
    # we can do the same thing with "tokens" to get counts of words
    # (but we want them to be lowercase!)
    word_counts = Counter([t.lower() for t in tokens])
    
    # Of the names above, the ones that start with "word" are
    # percentages of frequencies of words. Let's get the words
    # in question
    freq_words = [ 
        name.split("_")[-1]
        for name in names 
        if name.startswith("word")
    ]
    
    # compile the first 48 values using the words in question
    word_freq_encodings = [100. * (word_counts.get(t, 0) / len(tokens))
                           for t in freq_words]

So, all those features that we have up at the beginning tell us what words we're interested in counting. We can see that the original dataset is interested in counting words such as address, email, business, and credit, and then, for our characters, we're looking for opened and closed parentheses and dollar signs (which are quite relevant to our spam emails). So, we're going to count all of those shown as follows:

Apply the ratio and keep track of the total number of capital_runs, computing the mean average, maximum, and minimum:

 # make a np array to compute the next few stats quickly
capital_runs = np.asarray(capital_runs)
    capital_stats = [capital_runs.mean(), 
                     capital_runs.max(), 
                     capital_runs.sum()]

When we run the preceding code, we get the following output. This is going to encode our emails. This is just simply a vector of all the different features. It should be about 50 characters long:

# get the email vectors
fake_email = encode_email(spam_email)
real_email = encode_email(not_spam)
# this is what they look like:
print("Spam email:")
print(fake_email)
print("\nReal email:")
print(real_email)

The output of the preceding code is as follows:

When we fit the preceding values into our models, we will see whether our model is any good. So, ideally, we will see that the actual fake email is predicted to be fake, and the actual real email is predicted to be real. So, if the emails are predicted as fake, our spam prediction is indeed spam for both the decision tree and the logistic regression. Our true email is not spam, which perhaps is even more important, because we don't want to filter real email into the spam folder. So, you can see that we fitted some pretty good models here that apply to something that we would visually inspect as true spam or not:

predict = (lambda rec, mod: "SPAM!" if mod.predict([rec])[0] == 1 else "Not spam")

print("Decision tree predictions:")
print("Spam email prediction: %r" % predict(fake_email, decision_tree))
print("Real email prediction: %r" % predict(real_email, decision_tree))

print("\nLogistic regression predictions:")
print("Spam email prediction: %r" % predict(fake_email, logistic_regression))
print("Real email prediction: %r" % predict(real_email, logistic_regression))

The output of the preceding code is as follows:

This is a demo of the actual algorithms that we're going to build from scratch in this book, and can be applied to real-world problems.

官术网_书友最值得收藏!

Supervised Machine Learning with Python

Logistic regression