官术网_书友最值得收藏!

How to do it...

We will look at using urlopen and requests to handle HTML in UTF-8. These two libraries handle this differently, so let's examine this.  Let's start importing urllib, loading the page, and examining some of the content.

In [8]: from urllib.request import urlopen
...: page = urlopen("http://localhost:8080/unicode.html")
...: content = page.read()
...: content[840:1280]
...:
Out[8]: b'><strong>Cyrillic</strong> &nbsp; U+0400 \xe2\x80\x93 U+04FF &nbsp; (1024\xe2\x80\x931279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50">&nbsp;</td>\n <td class="b" width="50">\xd0\x89</td>\n <td class="b" width="50">\xd0\xa9</td>\n <td class="b" width="50">\xd1\x89</td>\n <td class="b" width="50">\xd3\x83</td>\n </tr>\n </tbody>\n </table>\n\n '
Note how the Cyrillic characters were read in as multi-byte codes using \ notation, such as \xd0\x89.

To rectify this, we can convert the content to UTF-8 format using the Python str statement:

In [9]: str(content, "utf-8")[837:1270]
Out[9]: '<strong>Cyrillic</strong> &nbsp; U+0400 – U+04FF &nbsp; (1024–1279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50">&nbsp;</td>\n <td class="b" width="50">?</td>\n <td class="b" width="50">Щ</td>\n <td class="b" width="50">щ</td>\n <td class="b" width="50">?</td>\n </tr>\n </tbody>\n </table>\n\n '
Note that the output now has the characters encoded properly.

We can exclude this extra step by using requests.

In [9]: import requests
...: response = requests.get("http://localhost:8080/unicode.html").text
...: response.text[837:1270]
...:
'<strong>Cyrillic</strong> &nbsp; U+0400 – U+04FF &nbsp; (1024–1279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50">&nbsp;</td>\n <td class="b" width="50">?</td>\n <td class="b" width="50">Щ</td>\n <td class="b" width="50">щ</td>\n <td class="b" width="50">?</td>\n </tr>\n </tbody>\n </table>\n\n '
主站蜘蛛池模板: 甘谷县| 黄龙县| 扶余县| 贵南县| 永寿县| 双柏县| 剑阁县| 许昌县| 许昌市| 南乐县| 札达县| 嵊泗县| 岫岩| 日照市| 安福县| 水城县| 宣城市| 元江| 深圳市| 霍州市| 扶绥县| 绿春县| 靖宇县| 明溪县| 巴东县| 隆回县| 郎溪县| 平阴县| 安乡县| 濉溪县| 鹤山市| 莒南县| 安徽省| 乌拉特中旗| 平舆县| 福建省| 柘荣县| 克什克腾旗| 静乐县| 城固县| 和硕县|