How to do it...

In the following steps, we curate a dataset and then use it to create a classifier to determine the file type. For demonstration purposes, we show how to obtain a collection of PowerShell scripts, Python scripts, and JavaScript files by scraping GitHub. A collection of samples obtained in this way can be found in the accompanying repository as PowerShellSamples.7z, PythonSamples.7z, and JavascriptSamples.7z. First, we will write the code for the JavaScript scraper:

Begin by importing the PyGitHub library in order to be able to call the GitHub API. We also import the base64 module for decoding the base64 encoded files:

import os
from github import Github
import base64

We must supply our credentials, and then specify a query—in this case, for JavaScript—to select our repositories:

username = "your_github_username"
password = "your_password"
target_dir = "/path/to/JavascriptSamples/"
g = Github(username, password)
repositories = g.search_repositories(query='language:javascript')
n = 5
i = 0

We loop over the repositories matching our criteria:

for repo in repositories:
    repo_name = repo.name
    target_dir_of_repo = target_dir+"\\"+repo_name
    print(repo_name)
    try:

We create a directory for each repository matching our search criteria, and then read in its contents:

        os.mkdir(target_dir_of_repo)
        i += 1
        contents = repo.get_contents("")

We add all directories of the repository to a queue in order to list all of the files contained within the directories:

        while len(contents) > 1:
            file_content = contents.pop(0)
            if file_content.type == "dir":
                contents.extend(repo.get_contents(file_content.path))
            else:

If we find a non-directory file, we check whether its extension is .js:

                st = str(file_content)
                filename = st.split("\"")[1].split("\"")[0]
                extension = filename.split(".")[-1]
                if extension == "js":

If the extension is .js, we write out a copy of the file:

                    file_contents = repo.get_contents(file_content.path)
                    file_data = base64.b64decode(file_contents.content)
                    filename = filename.split("/")[-1]
                    file_out = open(target_dir_of_repo+"/"+filename, "wb")
                    file_out.write(file_data)
      except:
        pass
    if i==n:
        break

Once finished, it is convenient to move all the JavaScript files into one folder.

To obtain PowerShell samples, run the same code, changing the following:

target_dir = "/path/to/JavascriptSamples/"
repositories = g.search_repositories(query='language:javascript')

To the following:

target_dir = "/path/to/PowerShellSamples/"
repositories = g.search_repositories(query='language:powershell').

Similarly, for Python files, we do the following:

target_dir = "/path/to/PythonSamples/"
repositories = g.search_repositories(query='language:python').

官术网_书友最值得收藏!

Machine Learning for Cybersecurity Cookbook

How to do it...