官术网_书友最值得收藏!

Phishing detection with logistic regression

In this section, we are going to build a phishing detector from scratch with a logistic regression algorithm. Logistic regression is a well-known statistical technique used to make binomial predictions (two classes).

Like in every machine learning project, we will need data to feed our machine learning model. For our model, we are going to use the UCI Machine Learning Repository (Phishing Websites Data Set). You can check it out at https://archive.ics.uci.edu/ml/datasets/Phishing+Websites:

The dataset is provided as an arff file:

The following is a snapshot from the dataset:

For better manipulation, we have organized the dataset into a csv file:

As you probably noticed from the attributes, each line of the dataset is represented in the following format – {30 Attributes (having_IP_Address URL_Length, abnormal_URL and so on)} + {1 Attribute (Result)}:

For our model, we are going to import two machine learning libraries, NumPy and scikit-learn, which we already installed in Chapter 1Introduction to Machine Learning in Pentesting.

Let's open the Python environment and load the required libraries:

>>> import numpy as np
>>> from sklearn import *
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.metrics import accuracy_score

Next, load the data:

training_data = np.genfromtxt('dataset.csv', delimiter=',', dtype=np.int32)

Identify the inputs (all of the attributes, except for the last one) and the outputs (the last attribute):

>>> inputs = training_data[:,:-1]
>>> outputs = training_data[:, -1]

In the previous chapter, we discussed how we need to pide the dataset into training data and testing data:

training_inputs = inputs[:2000]
training_outputs = outputs[:2000]
testing_inputs = inputs[2000:]
testing_outputs = outputs[2000:]

Create the scikit-learn logistic regression classifier:

classifier = LogisticRegression()

Train the classifier:

classifier.fit(training_inputs, training_outputs)

Make predictions:

predictions = classifier.predict(testing_inputs)

Let's print out the accuracy of our phishing detector model:

accuracy = 100.0 * accuracy_score(testing_outputs, predictions)

print ("The accuracy of your Logistic Regression on testing data is: " + str(accuracy))

The accuracy of our model is approximately 85%. This is a good accuracy, since our model detected 85 phishing URLs out of 100. But let's try to make an even better model with decision trees, using the same data.

主站蜘蛛池模板: 宣恩县| 买车| 九寨沟县| 伊金霍洛旗| 汝州市| 蒙阴县| 东乌珠穆沁旗| 嘉义市| 潮安县| 建湖县| 韩城市| 北安市| 贺兰县| 桂东县| 肇州县| 乌恰县| 连江县| 海宁市| 临朐县| 九龙县| 米脂县| 遂昌县| 章丘市| 英吉沙县| 靖江市| 井冈山市| 新宁县| 巫山县| 越西县| 洛阳市| 公安县| 湘潭市| 疏勒县| 纳雍县| 天全县| 黎城县| 公主岭市| 哈巴河县| 青阳县| 山西省| 沙田区|