Begin by importing the necessary libraries and specifying the paths of the samples we will be using to train and test:
import os from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix from sklearn.pipeline import Pipeline
Next, we read in all of the file types. We also create an array of labels with -1, 0, and 1 representing the JavaScript, Python, and PowerShell scripts, respectively:
corpus = [] labels = [] file_types_and_labels = [(javascript_path, -1), (python_path, 0), (powershell_path, 1)] for files_path, label in file_types_and_labels: files = os.listdir(files_path) for file in files: file_path = files_path + "/" + file try: with open(file_path, "r") as myfile: data = myfile.read().replace("\n", "") except: pass data = str(data) corpus.append(data) labels.append(label)
We go on to create a train-test split and a pipeline that will perform basic NLP on the files, followed by a random forest classifier: