官术网_书友最值得收藏!

Downloading the JavaScript

There's one more step before we can point this at a site  we need to download the actual JavaScript! Before analyzing the source code using our scanjs wrapper, we need to pull it from the target page. Pulling the code once in a single, discrete process (and from a single URL) means that, even as we develop more tooling around attack-surface reconnaissance, we can hook this script up to other services: it could pull the JavaScript from a URL supplied by a crawler, it could feed JavaScript or other assets into other analysis tools, or it could analyze other page metrics.

So the simplest version of this script should be: the script takes a URL, looks at the source code for that page to find all JavaScript libraries, and then downloads those files to the specified location.

The first thing we need to do is grab the HTML from the URL of the page we're inspecting. Let's add some code that accepts the url and directory CLI arguments, and defines our target and where to store the downloaded JavaScript. Then, let's use the requests library to pull the data and Beautiful Soup to make the HTML string a searchable object:

#!/usr/bin/env python2.7

import os, sys
import requests
from bs4 import BeautifulSoup

url = sys.argv[1]
directory = sys.argv[2]

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

Then we need to iterate over each script tag and use the src attribute data to download the file to a directory within our current root:

for script in soup.find_all('script'):
if script.get('src'): download_script(script.get('src'))

That download_script() function might not ring a bell because we haven't written it yet. But that's what we want  a function that takes the src attribute path, builds the link to the resource, and downloads it into the directory we've specified:

def download_script(uri):
address = url + uri if uri[0] == '/' else uri
filename = address[address.rfind("/")+1:address.rfind("js")+2]
req = requests.get(url)
with open(directory + '/' + filename, 'wb') as file:
file.write(req.content)

Each line is pretty direct. After the function definition, the HTTP address of the script is created using a Python ternary. If the src attribute starts with /, it's a relative path and can just be appended onto the hostname; if it doesn't, it must be a full/absolute link. Ternaries can be funky but also powerfully expressive once you get the hang of them.

The second line of the function creates the filename of the JavaScript library link by finding the character index of the last forward slash (address.rfind("/")) and the index of the js file extension, plus 2 to avoid slicing off the js part (address.rfind("js")+2)), and then uses the [begin:end] list-slicing syntax to create a new string from just the specified indices.

Then, in the third line, the script pulls data from the assembled address using requests, creates a new file using a context manager, and writes the page source code to /directory/filename.js. Now you have a location, the path passed in as an argument, and all of the JavaScript from a particular page saved inside of it.

主站蜘蛛池模板: 达尔| 利津县| 建宁县| 乃东县| 甘谷县| 邢台市| 锦州市| 德清县| 抚顺县| 禄劝| 新和县| 都匀市| 合江县| 温宿县| 长顺县| 洞头县| 隆德县| 岳阳市| 海林市| 观塘区| 墨脱县| 秦安县| 尉氏县| 建德市| 剑河县| 本溪市| 内江市| 和平县| 双鸭山市| 兴和县| 甘肃省| 池州市| 公安县| 同德县| 社会| 卓尼县| 个旧市| 河北省| 綦江县| 辛集市| 保德县|