- R for Data Science Cookbook
- Yu Wei Chiu (David Chiu)
- 1335字
- 2021-07-14 10:51:26
Scraping web data
In most cases, the majority of data will not exist in your database, but will instead be published in different forms on the Internet. To dig up more valuable information from these data sources, we need to know how to access and scrape data from the Web. Here, we will illustrate how to use the rvest
package to harvest finance data from http://www.bloomberg.com/.
Getting ready
In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.
How to do it…
Perform the following steps to scrape data from http://www.bloomberg.com/:
- First, access the following link to browse the S&P 500 index on the Bloomberg Business websitehttp://www.bloomberg.com/quote/SPX:IND:
Figure 9: S&P 500 index
- Once the page appears, as shown in the preceding screenshot, we can begin installing and loading the
rvest
package:> install.packages("rvest") > library(rvest)
- Next, you can use the HTML function from the
rvest
package to scrape and parse the HTML page of the link to the S&P 500 index at http://www.bloomberg.com/:> spx_quote <- html("http://www.bloomberg.com/quote/SPX:IND")
- Use the browser's built-in web inspector to inspect the location of the detail quote (marked with a red rectangle) below the index chart:
Figure 10: Inspecting the DOM location of S&P 500 index
- You can then move the mouse over the detail quote and click on the target element that you wish to scrape down. As the following screenshot shows, the
<div class="cell">
section holds all the information we need:Figure 11: Inspecting the DOM location of the detail quote
- Extract elements with the class of
cell
with thehtml_nodes
function:> cell <- spx_quote %>% html_nodes(".cell")
- Furthermore, we can parse the label of the detailed quote from elements with the class of
cell__label
, extract text from scraped HTML, and eventually clean spaces and newline characters from the extracted text:> label <- cell %>% + html_nodes(".cell__label") %>% + html_text() %>% + lapply(function(e) gsub("\n|\\s+", "", e))
- Also, we can extract the value of a detailed quote from the element with the class of
cell__value
, extract text from scraped HTML, as well as clean spaces and newline characters:> value <- cell %>% + html_nodes(".cell__value") %>% + html_text() %>% + lapply(function(e)gsub("\n|\\s+", "", e))
- Finally, we can set the extracted
label
as the name tovalue
:> names(value) <- title
- Next, we can access the energy and oil market index page at the following link (http://www.bloomberg.com/energy):
Figure 12: Inspecting the DOM location of crude oil and natural gas
- We can then use the web inspector to inspect the location of the table element:
Figure 13: Inspecting the DOM location of table element
- Finally, we can use
html_table
to extract the table element with the class ofdata-table
:> energy <- html("http://www.bloomberg.com/energy") > energy.table <- energy %>% html_node(".data-table") %>% html_table()
How it works…
The most difficult step in scraping data from a website is that web data is published and structured in different formats. You have to fully understand how data is structured within the HTML tag before continuing.
As HTML (Hypertext Markup Language) is a language that has similar syntax to XML, we can use the XML package to read and parse HTML pages. However, the XML package only provides the XPath
method, which has two main shortcomings, as follows:
- Inconsistent behavior in different browsers
- It is hard to read and maintain
For these reasons, we recommend using CSS selector over XPath when parsing HTML.
Python users may be familiar with how to scrape data quickly by using requests and BeautifulSoup
packages. The rvest
package is the counterpart package in R, which provides the same ability to simply and efficiently harvest data from HTML pages.
In this recipe, our target is to scrape the finance data of the S&P 500 detail quote from the Internet, which is followed by installing and loading the rvest
package. After installation and loading is complete, we can then use the HTML function to read the source code of the page to spx_quote
.
Once we have confirmed that we can read the HTML page, we can start parsing the detail quote from the scraped HTML. However, we first need to inspect the CSS path of the detail quote. There are many ways to inspect the CSS path of a specific element. The most popular method is to use the development tool built into each browser (press F12 or FN + F12) to inspect the CSS path. Using Google Chrome as an example, you can open the development tool by pressing F12. A DevTools window may show up somewhere in the visual area (refer to the following document page: https://developer.chrome.com/devtools/docs/dom-and-styles#inspecting-elements).
Then, you can move the mouse cursor to the upper left of the DevTools window and select the Inspect Element icon (a magnifier icon similar to ). Next, click on the target element, and the DevTools window will highlight the source code of the selected area. You can then move the mouse cursor to the highlighted area and right-click on it. From the pop-up menu, click on Copy CSS Path to extract the CSS path. Or, you can examine the source code and find that the selected element is structured in HTML code with the class of
cell
.
One highlight of rvest
is that it is designed to work with magrittr
, so that we can use a pipelines operator %>%
to chain output parsed at each stage. Thus, we can first obtain the output source by calling spx_quote
and then pipe the output to html_nodes
. As the html_nodes
function uses CSS selector to parse elements, the function takes basic selectors with type (for example, div
), ID (for example, #header
), and class (for example, .cell
). As the elements to be extracted have the class of cell
, you should place a period (.
) in front of cell
.
Lastly, we should extract both label and value from previously parsed nodes. Here, we first extract the element of class cell__label
, and then use html_text
to extract text. We can then use the gsub
function to clean spaces and newline characters from the parsed text. Likewise, we apply the same pipeline to extract the element of class class__value
. As we have extracted both label and value from the detail quote, we can apply the label as the name to the extracted values. We now have organized data from the Web to structured data.
Alternatively, we can also use rvest
to harvest tabular data. Similar to the process used to harvest the S&P 500 index, we can first access the energy and oil market index page. We can then use the web element inspector to find the element location of table data. As we have found the element located in the class of data-table
, we can use the html_table
function to read the table content into an R data frame.
There's more…
Instead of using the web inspector built into each browser, one can consider using SelectorGadget ( a very powerful and simple-to-use extension for Google Chrome, which enables the user to extract the CSS path of the target element with only a few clicks:
- To begin using SelectorGadget, access the following link: https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb. Then, click on the green button (shown in the red rectangle in the following screenshot) to install the plugin in Chrome:
Figure 14: Adding SelectorGadget to Chrome
- Next, click the upper-right icon to open SelectorGadget, and then select the area that needs to be scraped down. The selected area will be colored green, and the gadget will display the CSS path of the area and the number of elements matched to the path:
Figure 15: Using SelecorGadget to inspect the DOM location of the table element
- Finally, you can paste the extracted CSS path to
html_nodes
as an input argument to parse the data.
Besides rvest
, you can connect R with Selenium via Rselenium
to scrape the web page. Selenium was originally designed as an automated web application that enables the user to command a web browser to automate processes through simple scripts. However, you can also use Selenium to scrape data from the Internet. The following instruction presents a sample demo on how to scrape http://www.bloomberg.com/ using Rselenium
:
- First, access the following link to download the Selenium standalone server (http://www.seleniumhq.org/download/):
Figure 16: Downloading the Selenium standalone server driver
- Next, start the Selenium standalone server using the following command:
$ java -jar selenium-server-standalone-2.46.0.jar
- If you can successfully launch the standalone server, you should see the following message, which means you can connect to the server that binds to port
4444
:Figure 17: Initiating the Selenium standalone server
- At this point, you can begin installing and loading
RSelenium
with the following command:> install.packages("RSelenium") > library(RSelenium)
- After
RSelenium
is installed, register the driver and connect to the Selenium server:> remDr <- remoteDriver(remoteServerAddr = "localhost" + , port = 4444 + , browserName = "firefox" +)
- Examine the status of the registered driver:
> remDr$getStatus()
- Next, we navigate to http://www.bloomberg.com/:
> remDr$open() > remDr$navigate("http://www.bloomberg.com/quote/SPX:IND ")
- Finally, we can scrape the data by using CSS selector:
> webElem <- remDr$findElements('css selector', ".cell") > webData <- sapply(webElem, function(x){ + label <- x$findChildElement('css selector', '.cell__label') + value <- x$findChildElement('css selector', '.cell__value') + cbind(c("label" = label$getElementText(), "value" = value$getElementText())) + } + )
- Spring Boot 2實戰之旅
- 密碼學原理與Java實現
- 青少年美育趣味課堂:XMind思維導圖制作
- FLL+WRO樂高機器人競賽教程:機械、巡線與PID
- Learning SciPy for Numerical and Scientific Computing(Second Edition)
- Multithreading in C# 5.0 Cookbook
- HTML5秘籍(第2版)
- Learning AWS
- Unity&VR游戲美術設計實戰
- 零基礎學HTML+CSS第2版
- Go語言入門經典
- Building Business Websites with Squarespace 7(Second Edition)
- Python 3.6從入門到精通(視頻教學版)
- Mastering High Performance with Kotlin
- 梔子貓的奇幻編程之旅:21天探索信息學奧賽C++編程