13 Web Scraping in R

https://learn.datacamp.com/courses/web-scraping-in-r

13.1 Introduction to HTML and Web Scraping

We can use the rvest function using the rvest function from the rvest library to read HTML code into R.

library(rvest)
## Loading required package: xml2
html_excerpt_raw <- '
<html> 
  <body> 
    <h1>Web scraping is cool</h1>
    <p>It involves writing code – be it R or Python.</p>
    <p><a href="https://datacamp.com">DataCamp</a> 
        has courses on it.</p>
  </body> 
</html>'
# Turn the raw excerpt into an HTML document R understands
html_excerpt <- read_html(html_excerpt_raw)
html_excerpt
## {html_document}
## <html>
## [1] <body> \n    <h1>Web scraping is cool</h1>\n    <p>It involves writing co ...

We can use the xml_structure() function to get a better overview of the tag hierarchy of the HTML excerpt.

xml_structure(html_excerpt)
## <html>
##   <body>
##     {text}
##     <h1>
##       {text}
##     {text}
##     <p>
##       {text}
##     {text}
##     <p>
##       <a [href]>
##         {text}
##       {text}
##     {text}

We can extract certain nodes from a html using the html_node(‘name of element(children)’) function:

# Read in the corresponding HTML string
list_html <- read_html(list_html_raw)
# Extract the ol node
ol_node <- list_html %>% html_node('ol')
# Extract and print all the children from ol_node
ol_node %>% html_children()

We can parse the HRTML links into an R data frame. You’ll use tibble(), a function from the Tidyverse, for that. tibble() is basically a trimmed down version of data.frame(), which we certainly already know. Just like data.frame(), we specify column names and data as pairs of column names and values, like so:

my_tibble <- tibble( column_name_1 = value_1, column_name_2 = value_2, … )

Example HTML:

Helpful links

Wikipedia Dictionary Search Engine

Compiled with help from Google.

# Extract all the a nodes from the bulleted list
links <- hyperlink_html %>% 
  read_html() %>%
  html_nodes('li a') # 'ul a' is also correct!

# Extract the needed values for the data frame
domain_value = links %>% html_attr('href')
name_value = links %>% html_text()

# Construct a data frame
link_df <- tibble(
  domain = domain_value,
  name = name_value
)

link_df

Turn a table into a data frame with html_table()

If a table has a header row (with th elements) and no gaps, scraping it is straightforward, as with the following table (having ID “clean”):

Mountain Height First ascent Country Mount Everest 8848 1953 Nepal, China … Here’s the same table (having ID “dirty”) without a designated header row and a missing cell in the first row:

Mountain Height First ascent Country Mount Everest 8848 1953 … For such cases, html_table() has a couple of extra arguments you can use to correctly parse the table, as shown in the video.

Both tables are contained within the mountains_html document.

# Extract the "clean" table into a data frame 
mountains <- mountains_html %>% 
  html_node("table#clean") %>% 
  html_table()

mountains