- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 225字
- 2021-07-09 19:42:44
Finding the owner of a website
For some websites it may matter to us who the owner is. For example, if the owner is known to block web crawlers then it would be wise to be more conservative in our download rate. To find who owns a website we can use the WHOIS protocol to see who is the registered owner of the domain name. A Python wrapper to this protocol, documented at https://pypi.python.org/pypi/python-whois, can be installed via pip:
pip install python-whois
Here is the most informative part of the WHOIS response when querying the appspot.com domain with this module:
>>> import whois
>>> print(whois.whois('appspot.com'))
{
...
"name_servers": [
"NS1.GOOGLE.COM",
"NS2.GOOGLE.COM",
"NS3.GOOGLE.COM",
"NS4.GOOGLE.COM",
"ns4.google.com",
"ns2.google.com",
"ns1.google.com",
"ns3.google.com"
],
"org": "Google Inc.",
"emails": [
"abusecomplaints@markmonitor.com",
"dns-admin@google.com"
]
}
We can see here that this domain is owned by Google, which is correct; this domain is for the Google App Engine service. Google often blocks web crawlers despite being fundamentally a web crawling business themselves. We would need to be careful when crawling this domain because Google often blocks IPs that quickly scrape their services; and you, or someone you live or work with, might need to use Google services. I have experienced being asked to enter captchas to use Google services for short periods, even after running only simple search crawlers on Google domains.
- R語言數(shù)據(jù)分析從入門到精通
- Python測試開發(fā)入門與實踐
- SEO智慧
- iOS應(yīng)用逆向工程(第2版)
- INSTANT Django 1.5 Application Development Starter
- 青少年學(xué)Python(第1冊)
- ASP.NET開發(fā)與應(yīng)用教程
- 移動增值應(yīng)用開發(fā)技術(shù)導(dǎo)論
- Learning Android Application Testing
- Android技術(shù)內(nèi)幕(系統(tǒng)卷)
- Java Web開發(fā)基礎(chǔ)與案例教程
- ASP.NET本質(zhì)論
- 零基礎(chǔ)C語言學(xué)習(xí)筆記
- Backbone.js Patterns and Best Practices
- 循序漸進(jìn)Vue.js 3前端開發(fā)實戰(zhàn)