官术网_书友最值得收藏!

Identifying the technology used by a website

The type of technology used to build a websitewill affect how we crawl it. A useful tool to check the kind of technologies a website is built with is the module detectem, which requires Python 3.5+ and Docker. If you don't already have Docker installed, follow the instructions for your operating system at https://www.docker.com/products/overview. Once Docker is installed, you can run the following commands.

docker pull scrapinghub/splash
pip install detectem

This will pull the latest Docker image from ScrapingHub and install the package via pip. It is recommended to use a Python virtual environment (https://docs.python.org/3/library/venv.html) or a Conda environment (https://conda.io/docs/using/envs.html) and to check the project's ReadMe page (https://github.com/spectresearch/detectem) for any updates or changes.

Why use environments?
Imagine if your project was developed with an earlier version of a library such as  detectem, and then, in a later version, detectem introduced some backwards-incompatible changes that break your project. However, different projects you are working on would like to use the newer version. If your project uses the system-installed detectem, it is eventually going to break when libraries are updated to support other projects.
Ian Bicking's virtualenv provides a clever hack to this problem by copying the system Python executable and its dependencies into a local directory to create an isolated Python environment. This allows a project to install specific versions of Python libraries locally and independently of the wider system. You can even utilize different versions of Python in different virtual environments. Further details are available in the documentation at https://virtualenv.pypa.io. Conda environments offer similar functionality using the Anaconda Python path.

The detectem module uses a series of requests and responses to detect technologies used by the website, based on a series of extensible modules. It uses Splash (https://github.com/scrapinghub/splash), a scriptable browser developed by ScrapingHub (https://scrapinghub.com/). To run the module, simply use the det command:

       $ det http://example.webscraping.com
[('jquery', '1.11.0')]

We can see the example website uses a common JavaScript library, so its content is likely embedded in the HTML and should be relatively straightforward to scrape.

Detectem is still fairly young and aims to eventually have Python parity to Wappalyzer (https://github.com/AliasIO/Wappalyzer), a Node.js-based project supporting parsing of many different backends as well as ad networks, JavaScript libraries, and server setups. You can also run Wappalyzer via Docker. To first download the Docker image, run:

$ docker pull wappalyzer/cli

Then, you can run the script from the Docker instance:

$ docker run wappalyzer/cli http://example.webscraping.com

The output is a bit hard to read, but if we copy and paste it into a JSON linter, we can see the many different libraries and technologies detected:

{'applications': 
[{'categories': ['Javascript Frameworks'],
'confidence': '100',
'icon': 'Modernizr.png',
'name': 'Modernizr',
'version': ''},
{'categories': ['Web Servers'],
'confidence': '100',
'icon': 'Nginx.svg',
'name': 'Nginx',
'version': ''},
{'categories': ['Web Frameworks'],
'confidence': '100',
'icon': 'Twitter Bootstrap.png',
'name': 'Twitter Bootstrap',
'version': ''},
{'categories': ['Web Frameworks'],
'confidence': '100',
'icon': 'Web2py.png',
'name': 'Web2py',
'version': ''},
{'categories': ['Javascript Frameworks'],
'confidence': '100',
'icon': 'jQuery.svg',
'name': 'jQuery',
'version': ''},
{'categories': ['Javascript Frameworks'],
'confidence': '100',
'icon': 'jQuery UI.svg',
'name': 'jQuery UI',
'version': '1.10.3'},
{'categories': ['Programming Languages'],
'confidence': '100',
'icon': 'Python.png',
'name': 'Python',
'version': ''}],
'originalUrl': 'http://example.webscraping.com',
'url': 'http://example.webscraping.com'}

Here, we can see that Python and the web2py frameworks were detected with very high confidence. We can also see that the frontend CSS framework Twitter Bootstrap is used. Wappalyzer also detected Modernizer.js and the use of Nginx as the backend server. Because the site is only using JQuery and Modernizer, it is unlikely the entire page is loaded by JavaScript. If the website was instead built with AngularJS or React, then its content would likely be loaded dynamically. Or, if the website used ASP.NET, it would be necessary to use sessions and form submissions to crawl web pages. Working with these more difficult cases will be covered later in Chapter 5, Dynamic Content and Chapter 6, Interacting with Forms.

主站蜘蛛池模板: 岳普湖县| 宜城市| 稻城县| 邹城市| 巨鹿县| 霍山县| 乌恰县| 普格县| 东阳市| 黎城县| 高要市| 桦甸市| 绿春县| 上高县| 板桥市| 钟山县| 神木县| 东至县| 梁山县| 象州县| 基隆市| 台东县| 云安县| 博湖县| 东丰县| 巴彦县| 华蓥市| 两当县| 游戏| 扎鲁特旗| 烟台市| 长子县| 五河县| 安阳县| 宜州市| 宿松县| 岢岚县| 万荣县| 肥乡县| 游戏| 双江|