官术网_书友最值得收藏!

Getting data into R by scraping the web using the rvest package

In this section, we will focus on web scraping and how to implement it using the rvest package.

Web scraping is the procedure of converting unstructured data into a structured format. Structured data can be easily accessed and used. We will use R for scraping the data of most popular feature films from the IMDb website.

The following steps are implemented to get data into R using the rvest package:

  1. Install the rvest package. It is mandatory to install it, as it does not come as a built-in library:
> install.packages('rvest') 
package 'rvest' successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\Radhika\AppData\Local\Temp\RtmpMvNUA5\downloaded_packages
  1. Include the installed package in R's workspace:
> library(rvest)
  1. Let's start web scraping the IMDb website, which displays the most popular feature films in a given year:
> url <- 'https://www.imdb.com/search/title?count=100&release_date=2017,2017&title_type=feature'> #Reading html code from mentioned url> webpage <- read_html(url)> webpage{xml_document}<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<script type="text/ ...[2] <body id="styleguide-v2" class="fixed">\n\n <img height="1" width="1" style="display: ... 
  1. As you can see, there are various CSS selectors that can be used to scrape the required data:
> #Using CSS selectors to scrap the rankings section> rank_data_html <- html_nodes(webpage,'.text-primary')> rank_data_html{xml_nodeset (100)} [1] <span class="lister-item-index unbold text-primary">1.</span> [2] <span class="lister-item-index unbold text-primary">2.</span> [3] <span class="lister-item-index unbold text-primary">3.</span> [4] <span class="lister-item-index unbold text-primary">4.</span> [5] <span class="lister-item-index unbold text-primary">5.</span> [6] <span class="lister-item-index unbold text-primary">6.</span> [7] <span class="lister-item-index unbold text-primary">7.</span> [8] <span class="lister-item-index unbold text-primary">8.</span> [9] <span class="lister-item-index unbold text-primary">9.</span>[10] <span class="lister-item-index unbold text-primary">10.</span>[11] <span class="lister-item-index unbold text-primary">11.</span>[12] <span class="lister-item-index unbold text-primary">12.</span>[13] <span class="lister-item-index unbold text-primary">13.</span>[14] <span class="lister-item-index unbold text-primary">14.</span>[15] <span class="lister-item-index unbold text-primary">15.</span>[16] <span class="lister-item-index unbold text-primary">16.</span>[17] <span class="lister-item-index unbold text-primary">17.</span>[18] <span class="lister-item-index unbold text-primary">18.</span>[19] <span class="lister-item-index unbold text-primary">19.</span>[20] <span class="lister-item-index unbold text-primary">20.</span>...
  1. Use the following code to get the specific rank of each film:
> rank_data <- html_text(rank_data_html)> head(rank_data)[1] "1." "2." "3." "4." "5." "6."

In the next section, we will focus more on importing the data into R from databases using the required package.

主站蜘蛛池模板: 百色市| 通江县| 美姑县| 长春市| 图木舒克市| 肃南| 突泉县| 富裕县| 伊宁市| 千阳县| 盐池县| 手机| 电白县| 巴林左旗| 卢龙县| 平远县| 讷河市| 新野县| 台南市| 普兰店市| 从江县| 青冈县| 云霄县| 永善县| 凤凰县| 大关县| 女性| 武义县| 安乡县| 黄石市| 綦江县| 明溪县| 平谷区| 越西县| 成安县| 海淀区| 奇台县| 龙里县| 玉溪市| 岱山县| 章丘市|