In the following steps, we curate a dataset and then use it to create a classifier to determine the file type. For demonstration purposes, we show how to obtain a collection of PowerShell scripts, Python scripts, and JavaScript files by scraping GitHub. A collection of samples obtained in this way can be found in the accompanying repository as PowerShellSamples.7z, PythonSamples.7z, and JavascriptSamples.7z. First, we will write the code for the JavaScript scraper:
Begin by importing the PyGitHub library in order to be able to call the GitHub API. We also import the base64 module for decoding the base64 encoded files:
import os from github import Github import base64
We must supply our credentials, and then specify a query—in this case, for JavaScript—to select our repositories:
username = "your_github_username" password = "your_password" target_dir = "/path/to/JavascriptSamples/" g = Github(username, password) repositories = g.search_repositories(query='language:javascript') n = 5 i = 0
We loop over the repositories matching our criteria:
for repo in repositories: repo_name = repo.name target_dir_of_repo = target_dir+"\\"+repo_name print(repo_name) try:
We create a directory for each repository matching our search criteria, and then read in its contents:
os.mkdir(target_dir_of_repo) i += 1 contents = repo.get_contents("")
We add all directories of the repository to a queue in order to list all of the files contained within the directories:
while len(contents) > 1: file_content = contents.pop(0) if file_content.type == "dir": contents.extend(repo.get_contents(file_content.path)) else:
If we find a non-directory file, we check whether its extension is .js:
st = str(file_content) filename = st.split("\"")[1].split("\"")[0] extension = filename.split(".")[-1] if extension == "js":
If the extension is .js, we write out a copy of the file: