- Machine Learning for Cybersecurity Cookbook
- Emmanuel Tsukerman
- 348字
- 2021-06-24 12:29:06
How to do it...
In the following steps, we curate a dataset and then use it to create a classifier to determine the file type. For demonstration purposes, we show how to obtain a collection of PowerShell scripts, Python scripts, and JavaScript files by scraping GitHub. A collection of samples obtained in this way can be found in the accompanying repository as PowerShellSamples.7z, PythonSamples.7z, and JavascriptSamples.7z. First, we will write the code for the JavaScript scraper:
- Begin by importing the PyGitHub library in order to be able to call the GitHub API. We also import the base64 module for decoding the base64 encoded files:
import os
from github import Github
import base64
- We must supply our credentials, and then specify a query—in this case, for JavaScript—to select our repositories:
username = "your_github_username"
password = "your_password"
target_dir = "/path/to/JavascriptSamples/"
g = Github(username, password)
repositories = g.search_repositories(query='language:javascript')
n = 5
i = 0
- We loop over the repositories matching our criteria:
for repo in repositories:
repo_name = repo.name
target_dir_of_repo = target_dir+"\\"+repo_name
print(repo_name)
try:
- We create a directory for each repository matching our search criteria, and then read in its contents:
os.mkdir(target_dir_of_repo)
i += 1
contents = repo.get_contents("")
- We add all directories of the repository to a queue in order to list all of the files contained within the directories:
while len(contents) > 1:
file_content = contents.pop(0)
if file_content.type == "dir":
contents.extend(repo.get_contents(file_content.path))
else:
- If we find a non-directory file, we check whether its extension is .js:
st = str(file_content)
filename = st.split("\"")[1].split("\"")[0]
extension = filename.split(".")[-1]
if extension == "js":
- If the extension is .js, we write out a copy of the file:
file_contents = repo.get_contents(file_content.path)
file_data = base64.b64decode(file_contents.content)
filename = filename.split("/")[-1]
file_out = open(target_dir_of_repo+"/"+filename, "wb")
file_out.write(file_data)
except:
pass
if i==n:
break
- Once finished, it is convenient to move all the JavaScript files into one folder.
To obtain PowerShell samples, run the same code, changing the following:
target_dir = "/path/to/JavascriptSamples/"
repositories = g.search_repositories(query='language:javascript')
To the following:
target_dir = "/path/to/PowerShellSamples/"
repositories = g.search_repositories(query='language:powershell').
Similarly, for Python files, we do the following:
target_dir = "/path/to/PythonSamples/"
repositories = g.search_repositories(query='language:python').
推薦閱讀
- 玩轉(zhuǎn)智能機(jī)器人程小奔
- Java實(shí)用組件集
- ServiceNow Cookbook
- Zabbix Network Monitoring(Second Edition)
- 快學(xué)Flash動(dòng)畫(huà)百例
- 網(wǎng)絡(luò)化分布式系統(tǒng)預(yù)測(cè)控制
- 從零開(kāi)始學(xué)C++
- 工業(yè)自動(dòng)化技術(shù)實(shí)訓(xùn)指導(dǎo)
- R Data Analysis Projects
- 貫通Java Web輕量級(jí)應(yīng)用開(kāi)發(fā)
- 電氣控制及Micro800 PLC程序設(shè)計(jì)
- Apache Spark Quick Start Guide
- PostgreSQL High Performance Cookbook
- x86/x64體系探索及編程
- DevOps:Puppet,Docker,and Kubernetes