pg象财神游戏视频

書名： R for Data Science Cookbook
作者名： Yu Wei Chiu (David Chiu)
本章字數： 1335字
更新時間： 2021-07-14 10:51:26

Scraping web data

In most cases, the majority of data will not exist in your database, but will instead be published in different forms on the Internet. To dig up more valuable information from these data sources, we need to know how to access and scrape data from the Web. Here, we will illustrate how to use the rvest package to harvest finance data from http://www.bloomberg.com/.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Perform the following steps to scrape data from http://www.bloomberg.com/:

First, access the following link to browse the S&P 500 index on the Bloomberg Business websitehttp://www.bloomberg.com/quote/SPX:IND:

Figure 9: S&P 500 index
Once the page appears, as shown in the preceding screenshot, we can begin installing and loading the rvest package:
```
> install.packages("rvest")
> library(rvest)
```
Next, you can use the HTML function from the rvest package to scrape and parse the HTML page of the link to the S&P 500 index at http://www.bloomberg.com/:
```
> spx_quote <- html("http://www.bloomberg.com/quote/SPX:IND")
```
Use the browser's built-in web inspector to inspect the location of the detail quote (marked with a red rectangle) below the index chart:

Figure 10: Inspecting the DOM location of S&P 500 index
You can then move the mouse over the detail quote and click on the target element that you wish to scrape down. As the following screenshot shows, the <div class="cell"> section holds all the information we need:

Figure 11: Inspecting the DOM location of the detail quote
Extract elements with the class of cell with the html_nodes function:
```
> cell <- spx_quote %>% html_nodes(".cell")
```
Furthermore, we can parse the label of the detailed quote from elements with the class of cell__label, extract text from scraped HTML, and eventually clean spaces and newline characters from the extracted text:
```
> label <- cell %>% 
+ html_nodes(".cell__label") %>% 
+ html_text() %>% 
+ lapply(function(e) gsub("\n|\\s+", "", e))
```
Also, we can extract the value of a detailed quote from the element with the class of cell__value, extract text from scraped HTML, as well as clean spaces and newline characters:
```
> value <- cell %>% 
+ html_nodes(".cell__value") %>% 
+ html_text() %>% 
+ lapply(function(e)gsub("\n|\\s+", "", e))
```
Finally, we can set the extracted label as the name to value:
```
> names(value) <- title
```
Next, we can access the energy and oil market index page at the following link (http://www.bloomberg.com/energy):

Figure 12: Inspecting the DOM location of crude oil and natural gas
We can then use the web inspector to inspect the location of the table element:

Figure 13: Inspecting the DOM location of table element

Finally, we can use html_table to extract the table element with the class of data-table:

> energy <- html("http://www.bloomberg.com/energy")
> energy.table <- energy %>% 
 html_node(".data-table") %>% html_table()

How it works…

The most difficult step in scraping data from a website is that web data is published and structured in different formats. You have to fully understand how data is structured within the HTML tag before continuing.

As HTML (Hypertext Markup Language) is a language that has similar syntax to XML, we can use the XML package to read and parse HTML pages. However, the XML package only provides the XPath method, which has two main shortcomings, as follows:

Inconsistent behavior in different browsers
It is hard to read and maintain

For these reasons, we recommend using CSS selector over XPath when parsing HTML.

Python users may be familiar with how to scrape data quickly by using requests and BeautifulSoup packages. The rvest package is the counterpart package in R, which provides the same ability to simply and efficiently harvest data from HTML pages.

In this recipe, our target is to scrape the finance data of the S&P 500 detail quote from the Internet, which is followed by installing and loading the rvest package. After installation and loading is complete, we can then use the HTML function to read the source code of the page to spx_quote.

Once we have confirmed that we can read the HTML page, we can start parsing the detail quote from the scraped HTML. However, we first need to inspect the CSS path of the detail quote. There are many ways to inspect the CSS path of a specific element. The most popular method is to use the development tool built into each browser (press F12 or FN + F12) to inspect the CSS path. Using Google Chrome as an example, you can open the development tool by pressing F12. A DevTools window may show up somewhere in the visual area (refer to the following document page: https://developer.chrome.com/devtools/docs/dom-and-styles#inspecting-elements).

Then, you can move the mouse cursor to the upper left of the DevTools window and select the Inspect Element icon (a magnifier icon similar to How it works… ). Next, click on the target element, and the DevTools window will highlight the source code of the selected area. You can then move the mouse cursor to the highlighted area and right-click on it. From the pop-up menu, click on Copy CSS Path to extract the CSS path. Or, you can examine the source code and find that the selected element is structured in HTML code with the class of cell.

One highlight of rvest is that it is designed to work with magrittr, so that we can use a pipelines operator %>% to chain output parsed at each stage. Thus, we can first obtain the output source by calling spx_quote and then pipe the output to html_nodes. As the html_nodes function uses CSS selector to parse elements, the function takes basic selectors with type (for example, div), ID (for example, #header), and class (for example, .cell). As the elements to be extracted have the class of cell, you should place a period (.) in front of cell.

Lastly, we should extract both label and value from previously parsed nodes. Here, we first extract the element of class cell__label, and then use html_text to extract text. We can then use the gsub function to clean spaces and newline characters from the parsed text. Likewise, we apply the same pipeline to extract the element of class class__value. As we have extracted both label and value from the detail quote, we can apply the label as the name to the extracted values. We now have organized data from the Web to structured data.

Alternatively, we can also use rvest to harvest tabular data. Similar to the process used to harvest the S&P 500 index, we can first access the energy and oil market index page. We can then use the web element inspector to find the element location of table data. As we have found the element located in the class of data-table, we can use the html_table function to read the table content into an R data frame.

There's more…

Instead of using the web inspector built into each browser, one can consider using SelectorGadget ( a very powerful and simple-to-use extension for Google Chrome, which enables the user to extract the CSS path of the target element with only a few clicks:

To begin using SelectorGadget, access the following link: https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb. Then, click on the green button (shown in the red rectangle in the following screenshot) to install the plugin in Chrome:

Figure 14: Adding SelectorGadget to Chrome
Next, click the upper-right icon to open SelectorGadget, and then select the area that needs to be scraped down. The selected area will be colored green, and the gadget will display the CSS path of the area and the number of elements matched to the path:

Figure 15: Using SelecorGadget to inspect the DOM location of the table element
Finally, you can paste the extracted CSS path to html_nodes as an input argument to parse the data.

Besides rvest, you can connect R with Selenium via Rselenium to scrape the web page. Selenium was originally designed as an automated web application that enables the user to command a web browser to automate processes through simple scripts. However, you can also use Selenium to scrape data from the Internet. The following instruction presents a sample demo on how to scrape http://www.bloomberg.com/ using Rselenium:

First, access the following link to download the Selenium standalone server (http://www.seleniumhq.org/download/):

Figure 16: Downloading the Selenium standalone server driver
Next, start the Selenium standalone server using the following command:
```
$ java -jar selenium-server-standalone-2.46.0.jar
```
If you can successfully launch the standalone server, you should see the following message, which means you can connect to the server that binds to port 4444:

Figure 17: Initiating the Selenium standalone server
At this point, you can begin installing and loading RSelenium with the following command:
```
> install.packages("RSelenium")
> library(RSelenium)
```

After RSelenium is installed, register the driver and connect to the Selenium server:

> remDr <- remoteDriver(remoteServerAddr = "localhost" 
+ , port = 4444
+ , browserName = "firefox"
+)

Examine the status of the registered driver:
```
> remDr$getStatus()
```

Next, we navigate to http://www.bloomberg.com/:

> remDr$open()
> remDr$navigate("http://www.bloomberg.com/quote/SPX:IND ")

Finally, we can scrape the data by using CSS selector:

> webElem <- remDr$findElements('css selector', ".cell") 
> webData <- sapply(webElem, function(x){
+ label <- x$findChildElement('css selector', '.cell__label')
+ value <- x$findChildElement('css selector', '.cell__value')
+ cbind(c("label" = label$getElementText(), "value" = value$getElementText()))
+ }
+ )

官术网_书友最值得收藏!

R for Data Science Cookbook

Scraping web data

Getting ready

How to do it…

How it works…

There's more…