官术网_书友最值得收藏!

Finding the owner of a website

For some websites it may matter to us who the owner is. For example, if the owner is known to block web crawlers then it would be wise to be more conservative in our download rate. To find who owns a website we can use the WHOIS protocol to see who is the registered owner of the domain name. A Python wrapper to this protocol, documented at https://pypi.python.org/pypi/python-whois, can be installed via pip:

   pip install python-whois

Here is the most informative part of the WHOIS response when querying the appspot.com domain with this module:

   >>> import whois
>>> print(whois.whois('appspot.com'))
{
...
"name_servers": [
"NS1.GOOGLE.COM",
"NS2.GOOGLE.COM",
"NS3.GOOGLE.COM",
"NS4.GOOGLE.COM",
"ns4.google.com",
"ns2.google.com",
"ns1.google.com",
"ns3.google.com"
],
"org": "Google Inc.",
"emails": [
"abusecomplaints@markmonitor.com",
"dns-admin@google.com"
]
}

We can see here that this domain is owned by Google, which is correct; this domain is for the Google App Engine service. Google often blocks web crawlers despite being fundamentally a web crawling business themselves. We would need to be careful when crawling this domain because Google often blocks IPs that quickly scrape their services; and you, or someone you live or work with, might need to use Google services. I have experienced being asked to enter captchas to use Google services for short periods, even after running only simple search crawlers on Google domains.

主站蜘蛛池模板: 兴城市| 八宿县| 五寨县| 天全县| 兴文县| 腾冲县| 曲沃县| 邵阳县| 咸宁市| 随州市| 辽阳县| 光泽县| 长汀县| 鄂托克旗| 新郑市| 洛宁县| 祁连县| 邛崃市| 全州县| 馆陶县| 尖扎县| 延津县| 习水县| 松滋市| 贡嘎县| 海盐县| 容城县| 河间市| 宜春市| 徐闻县| 乌兰县| 玉树县| 高雄市| 佛冈县| 孟津县| 麻城市| 定陶县| 淮阳县| 乐业县| 铜陵市| 密山市|