官术网_书友最值得收藏!

How to do it...

In the next steps, we demonstrate how to apply the Isolation Forest algorithm to detecting anomalies:

  1. Import the required libraries and set a random seed:
import numpy as np
import pandas as pd

random_seed = np.random.RandomState(12)
  1. Generate a set of normal observations, to be used as training data:
X_train = 0.5 * random_seed.randn(500, 2)
X_train = np.r_[X_train + 3, X_train]
X_train = pd.DataFrame(X_train, columns=["x", "y"])
  1. Generate a testing set, also consisting of normal observations:
X_test = 0.5 * random_seed.randn(500, 2)
X_test = np.r_[X_test + 3, X_test]
X_test = pd.DataFrame(X_test, columns=["x", "y"])

  1. Generate a set of outlier observations. These are generated from a different distribution than the normal observations:
X_outliers = random_seed.uniform(low=-5, high=5, size=(50, 2))
X_outliers = pd.DataFrame(X_outliers, columns=["x", "y"])
  1. Let's take a look at the data we have generated:
%matplotlib inline
import matplotlib.pyplot as plt

p1 = plt.scatter(X_train.x, X_train.y, c="white", s=50, edgecolor="black")
p2 = plt.scatter(X_test.x, X_test.y, c="green", s=50, edgecolor="black")
p3 = plt.scatter(X_outliers.x, X_outliers.y, c="blue", s=50, edgecolor="black")
plt.xlim((-6, 6))
plt.ylim((-6, 6))
plt.legend(
[p1, p2, p3],
["training set", "normal testing set", "anomalous testing set"],
loc="lower right",
)

plt.show()

The following screenshot shows the output:

 

  1. Now train an Isolation Forest model on our training data:
from sklearn.ensemble import IsolationForest

clf = IsolationForest()
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
  1. Let's see how the algorithm performs. Append the labels to X_outliers:
X_outliers = X_outliers.assign(pred=y_pred_outliers)
X_outliers.head()

The following is the output:

  1. Let's plot the Isolation Forest predictions on the outliers to see how many it caught:
p1 = plt.scatter(X_train.x, X_train.y, c="white", s=50, edgecolor="black")
p2 = plt.scatter(
X_outliers.loc[X_outliers.pred == -1, ["x"]],
X_outliers.loc[X_outliers.pred == -1, ["y"]],
c="blue",
s=50,
edgecolor="black",
)
p3 = plt.scatter(
X_outliers.loc[X_outliers.pred == 1, ["x"]],
X_outliers.loc[X_outliers.pred == 1, ["y"]],
c="red",
s=50,
edgecolor="black",
)

plt.xlim((-6, 6))
plt.ylim((-6, 6))
plt.legend(
[p1, p2, p3],
["training observations", "detected outliers", "incorrectly labeled outliers"],
loc="lower right",
)

plt.show()

The following screenshot shows the output:

  1. Now let's see how it performed on the normal testing data. Append the predicted label to X_test:
X_test = X_test.assign(pred=y_pred_test)
X_test.head()

The following is the output:

  1. Now let's plot the results to see whether our classifier labeled the normal testing data correctly:
p1 = plt.scatter(X_train.x, X_train.y, c="white", s=50, edgecolor="black")
p2 = plt.scatter(
X_test.loc[X_test.pred == 1, ["x"]],
X_test.loc[X_test.pred == 1, ["y"]],
c="blue",
s=50,
edgecolor="black",
)
p3 = plt.scatter(
X_test.loc[X_test.pred == -1, ["x"]],
X_test.loc[X_test.pred == -1, ["y"]],
c="red",
s=50,
edgecolor="black",
)

plt.xlim((-6, 6))
plt.ylim((-6, 6))
plt.legend(
[p1, p2, p3],
[
"training observations",
"correctly labeled test observations",
"incorrectly labeled test observations",
],
loc="lower right",
)

plt.show()

The following screenshot shows the output:

Evidently, our Isolation Forest model performed quite well at capturing the anomalous points. There were quite a few false negatives (instances where normal points were classified as outliers), but by tuning our model's parameters, we may be able to reduce these.

主站蜘蛛池模板: 漳浦县| 巧家县| 潼南县| 台湾省| 子洲县| 永安市| 天等县| 大田县| 巴中市| 含山县| 商城县| 怀化市| 阿合奇县| 汤阴县| 军事| 太仆寺旗| 靖宇县| 汤原县| 鸡泽县| 丹凤县| 梨树县| 平潭县| 藁城市| 仁怀市| 海丰县| 潜山县| 偏关县| 阿巴嘎旗| 云南省| 沅江市| 安庆市| 余庆县| 方正县| 蒙城县| 堆龙德庆县| 衡阳县| 乌拉特前旗| 聂荣县| 饶河县| 宁乡县| 外汇|