官术网_书友最值得收藏!

How to do it...

The code for the following can be found on https://github.com/PacktPublishing/Machine-Learning-for-Cybersecurity-Cookbook/blob/master/Chapter02/Classifying%20Files%20by%20Type/File%20Type%20Classifier.ipynb. We build a classifier using this data to predict files as JavaScript, Python, or PowerShell:

  1. Begin by importing the necessary libraries and specifying the paths of the samples we will be using to train and test:
import os
from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline

javascript_path = "/path/to/JavascriptSamples/"
python_path = "/path/to/PythonSamples/"
powershell_path = "/path/to/PowerShellSamples/"
  1. Next, we read in all of the file types. We also create an array of labels with -1, 0, and 1 representing the JavaScript, Python, and PowerShell scripts, respectively:
corpus = []
labels = []
file_types_and_labels = [(javascript_path, -1), (python_path, 0), (powershell_path, 1)]
for files_path, label in file_types_and_labels:
files = os.listdir(files_path)
for file in files:
file_path = files_path + "/" + file
try:
with open(file_path, "r") as myfile:
data = myfile.read().replace("\n", "")
except:
pass
data = str(data)
corpus.append(data)
labels.append(label)

  1. We go on to create a train-test split and a pipeline that will perform basic NLP on the files, followed by a random forest classifier:
X_train, X_test, y_train, y_test = train_test_split(
corpus, labels, test_size=0.33, random_state=11
)
text_clf = Pipeline(
[
("vect", HashingVectorizer(input="content", ngram_range=(1, 3))),
("tfidf", TfidfTransformer(use_idf=True,)),
("rf", RandomForestClassifier(class_weight="balanced")),
]
)
  1. We fit the pipeline to the training data, and then use it to predict on the testing data. Finally, we print out the accuracy and the confusion matrix:
text_clf.fit(X_train, y_train)
y_test_pred = text_clf.predict(X_test)
print(accuracy_score(y_test, y_test_pred))
print(confusion_matrix(y_test, y_test_pred))

This results in the following output:

主站蜘蛛池模板: 金乡县| 南漳县| 永新县| 全南县| 巫山县| 健康| 祁门县| 沙湾县| 南陵县| 巨野县| 固安县| 井冈山市| 和静县| 金山区| 汝城县| 张家界市| 丰原市| 泰宁县| 沙坪坝区| 康定县| 临江市| 碌曲县| 礼泉县| 华亭县| 平南县| 云安县| 衡东县| 于都县| 孙吴县| 嘉荫县| 土默特右旗| 宜章县| 镶黄旗| 河北区| 乌兰县| 鹤庆县| 三台县| 嫩江县| 麦盖提县| 松滋市| 嘉峪关市|