官术网_书友最值得收藏!

How to do it...

We will start with a fresh iPython session and start by loading the planets page:

In [1]: import requests
...: from bs4 import BeautifulSoup
...: html = requests.get("http://localhost:8080/planets.html").text
...: soup = BeautifulSoup(html, "lxml")
...:

In the previous recipe, to access all of the <tr> in the table, we used a chained property syntax to get the table, and then needed to get the children and iterator over them.  This does have a problem as the children could be elements other than <tr>.  A more preferred method of getting just the <tr> child elements is to use findAll.

Lets start by first finding the <table>:

In [4]: table = soup.find("table")
...: str(table)[:100]
...:
Out[4]: '<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n Nam'

This tells the soup object to find the first <table> element in the document.  From this element we can find all of the <tr> elements that are descendants of the table with findAll:

In [8]: [str(tr)[:50] for tr in table.findAll("tr")]
Out[8]:
['<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n ',
'<tr class="planet" id="planet1" name="Mercury">\n<t',
'<tr class="planet" id="planet2" name="Venus">\n<td>',
'<tr class="planet" id="planet3" name="Earth">\n<td>',
'<tr class="planet" id="planet4" name="Mars">\n<td>\n',
'<tr class="planet" id="planet5" name="Jupiter">\n<t',
'<tr class="planet" id="planet6" name="Saturn">\n<td',
'<tr class="planet" id="planet7" name="Uranus">\n<td',
'<tr class="planet" id="planet8" name="Neptune">\n<t',
'<tr class="planet" id="planet9" name="Pluto">\n<td>']
Note that these are the descendants and not immediate children.  Change the query to "td" to see the difference.  The are no direct children that are <td>, but each row has multiple <td> elements.  In all, there would be 54 <td> elements found.

There is a small issue here if we want only rows that contain data for planets. The table header is also included.  We can fix this by utilizing the id attribute of the target rows.  The following finds the row where the value of id is "planet3".

In [14]: table.find("tr", {"id": "planet3"})
...:
Out[14]:
<tr class="planet" id="planet3" name="Earth">
<td>
<img src="img/earth-150x150.png"/>
</td>
<td>
Earth
</td>
<td>
5.97
</td>
<td>
12756
</td>
<td>
The name Earth comes from the Indo-European base 'er,'which produced the Germanic noun 'ertho,' and ultimately German 'erde,'
Dutch 'aarde,' Scandinavian 'jord,' and English 'earth.' Related forms include Greek 'eraze,' meaning
'on the ground,' and Welsh 'erw,' meaning 'a piece of land.'
</td>
<td>
<a >Wikipedia</a>
</td>
</tr>

Awesome! We used the fact that this page uses this attribute to represent table rows with actual data.

Now let's go one step further and collect the masses for each planet and put the name and mass in a dictionary:

In [18]: items = dict()
...: planet_rows = table.findAll("tr", {"class": "planet"})
...: for i in planet_rows:
...: tds = i.findAll("td")
...: items[tds[1].text.strip()] = tds[2].text.strip()
...:

In [19]: items
Out[19]:
{'Earth': '5.97',
'Jupiter': '1898',
'Mars': '0.642',
'Mercury': '0.330',
'Neptune': '102',
'Pluto': '0.0146',
'Saturn': '568',
'Uranus': '86.8',
'Venus': '4.87'}

And just like that we have made a nice data structure from the content embedded within the page.

主站蜘蛛池模板: 孝昌县| 霍城县| 夹江县| 双柏县| 阿勒泰市| 长武县| 郁南县| 炎陵县| 丘北县| 庆元县| 万安县| 镇江市| 尉犁县| 汨罗市| 晋中市| 峨眉山市| 克山县| 綦江县| 靖安县| 湘阴县| 达拉特旗| 全椒县| 昭通市| 朝阳市| 恩平市| 安多县| 宁晋县| 岳西县| 中牟县| 甘孜县| 同心县| 萝北县| 清水河县| 舞阳县| 潜江市| 九江市| 南澳县| 汶上县| 长宁区| 巴彦县| 逊克县|