12 Web Scraping in R
https://learn.datacamp.com/courses/web-scraping-in-r
12.1 Introduction to HTML and Web Scraping
Read in HTML
A necessary package to read HTML is rvest
:
library(rvest)
library(tidyverse)
library(httr)
Take the html_excerpt_raw
variable and turn it into an HTML document that R understands using a function from the rvest
package:
<- '
html_excerpt_raw <html>
<body>
<h1>Web scraping is cool</h1>
<p>It involves writing code – be it R or Python.</p>
<p><a href="https://datacamp.com">DataCamp</a>
has courses on it.</p>
</body>
</html>'
# Turn the raw excerpt into an HTML document R understands
<- read_html(html_excerpt_raw)
html_excerpt html_excerpt
## {html_document}
## <html>
## [1] <body> \n <h1>Web scraping is cool</h1>\n <p>It involves writing co ...
Use the xml_structure()
function to get a better overview of the tag hierarchy of the HTML excerpt:
xml_structure(html_excerpt)
## <html>
## <body>
## {text}
## <h1>
## {text}
## {text}
## <p>
## {text}
## {text}
## <p>
## <a [href]>
## {text}
## {text}
## {text}
read_html(url)
: scrape HTML content from a given URL
html_nodes()
: identifies HTML wrappers.
html_nodes(“.class”)
: calls node based on CSS class
html_nodes(“#id”)
: calls node based on
id
html_nodes(xpath=”xpath”)
: calls node based on xpath (we’ll cover this later)
html_table()
: turns HTML tables into data frames
html_text()
: strips the HTML tags and extracts only the text
12.3 Advanced Selection with XPATH
Select by class and ID with XPATH
<- "
weather_html <html>
<body>
<div id = 'first'>
<h1 class = 'big'>Berlin Weather Station</h1>
<p class = 'first'>Temperature: 20°C</p>
<p class = 'second'>Humidity: 45%</p>
</div>
<div id = 'second'>...</div>
<div id = 'third'>
<p class = 'first'>Sunshine: 5hrs</p>
<p class = 'second'>Precipitation: 0mm</p>
</div>
</body>
</html>"
<- read_html(weather_html) weather_html
Start by selecting all p
tags in the above HTML using XPATH
.
# Select all p elements
%>%
weather_html html_nodes(xpath = '//p')
## {xml_nodeset (4)}
## [1] <p class="first">Temperature: 20°C</p>
## [2] <p class="second">Humidity: 45%</p>
## [3] <p class="first">Sunshine: 5hrs</p>
## [4] <p class="second">Precipitation: 0mm</p>
Now select only the p
elements with class second
.
The corresponding CSS selector would be .second
, so here you need to use a [@class = ...]
predicate applied to all p
tags.
# Select p elements with the second class
%>%
weather_html html_nodes(xpath = '//p[@class = "second"]')
## {xml_nodeset (2)}
## [1] <p class="second">Humidity: 45%</p>
## [2] <p class="second">Precipitation: 0mm</p>
Now select all p
elements that are children of the element with ID third
.
The corresponding CSS selector would be #third > p
– don’t forget the universal selector (*
) before the @id = ...
predicate and remember that children are selected with a /
, not a //
.
# Select p elements that are children of "#third"
%>%
weather_html html_nodes(xpath = '//*[@id = "third"]/p')
## {xml_nodeset (2)}
## [1] <p class="first">Sunshine: 5hrs</p>
## [2] <p class="second">Precipitation: 0mm</p>
Now select only the p
element with class second
that is a direct child of #third
, again using XPATH.
Here, you need to append to the XPATH from the previous step the @class
predicate you used in the second step.
# Select p elements with class "second" that are children of "#third"
%>%
weather_html html_nodes(xpath = '//*[@id = "third"]/p[@class = "second"]')
## {xml_nodeset (1)}
## [1] <p class="second">Precipitation: 0mm</p>
Use predicates to select nodes based on their children
Here’s almost the same HTML as before. In addition, the third div
has a p
child with a third
class.
<- "<html>
weather_html_2 <body>
<div id = 'first'>
<h1 class = 'big'>Berlin Weather Station</h1>
<p class = 'first'>Temperature: 20°C</p>
<p class = 'second'>Humidity: 45%</p>
</div>
<div id = 'second'>...</div>
<div id = 'third'>
<p class = 'first'>Sunshine: 5hrs</p>
<p class = 'second'>Precipitation: 0mm</p>
<p class = 'third'>Snowfall: 0mm</p>
</div>
</body>
</html>"
<- read_html(weather_html_2) weather_html_2
With XPATH, something that’s not possible with CSS can be done: selecting elements based on the properties of their descendants. For this, predicates may be used.
Using XPATH, select all the div
elements.
# Select all divs
%>%
weather_html_2 html_nodes(xpath = '//div')
## {xml_nodeset (3)}
## [1] <div id="first">\n <h1 class="big">Berlin Weather Station</h1>\n ...
## [2] <div id="second">...</div>
## [3] <div id="third">\n <p class="first">Sunshine: 5hrs</p>\n <p cla ...
Now select all div
s with p
descendants using the predicate notation.
# Select all divs with p descendants
%>%
weather_html_2 html_nodes(xpath = '//div[p]')
## {xml_nodeset (2)}
## [1] <div id="first">\n <h1 class="big">Berlin Weather Station</h1>\n ...
## [2] <div id="third">\n <p class="first">Sunshine: 5hrs</p>\n <p cla ...
Now select div
s with p
descendants which have the third
class.
# Select all divs with p descendants having the "third" class
%>%
weather_html_2 html_nodes(xpath = '//div[p[@class = "third"]]')
## {xml_nodeset (1)}
## [1] <div id="third">\n <p class="first">Sunshine: 5hrs</p>\n <p cla ...
Get to know the position() function
position()
function is very powerful when used within a predicate. Together with operators, you can basically select any node from those that match a certain path.
You’ll try this out with the following HTML excerpt that is available to you via rules_html
. Let’s assume this is a continuously updated website that displays certain Coronavirus rules for a given day and the day after.
<- "<html>
rules_html <div>
<h2>Today's rules</h2>
<p>Wear a mask</p>
<p>Wash your hands</p>
</div>
<div>
<h2>Tomorrow's rules</h2>
<p>Wear a mask</p>
<p>Wash your hands</p>
<small>Bring hand sanitizer with you</small>
</div>
</html>"
<- read_html(rules_html) rules_html
Extract the text of the second p
in every div
using XPATH.
# Select the text of the second p in every div
%>%
rules_html html_nodes(xpath = '//div/p[position() = 2]') %>%
html_text()
## [1] "Wash your hands" "Wash your hands"
Now extract the text of every p
(except the second
) in every div
.
# Select every p except the second from every div
%>%
rules_html html_nodes(xpath = '//div/p[position() != 2]') %>%
html_text()
## [1] "Wear a mask" "Wear a mask"
Extract the text of the last three children of the second div
.
Only use the >=
operator for selecting these nodes.
# Select the text of the last three nodes of the second div
%>%
rules_html html_nodes(xpath = '//div[position() = 2]/*[position() >= 2]') %>%
html_text()
## [1] "Wear a mask" "Wash your hands"
## [3] "Bring hand sanitizer with you"
Extract nodes based on the number of their children
XPATH count()
function can be used within a predicate to narrow down a selection to these nodes that match a certain children count. This is especially helpful if your scraper depends on some nodes having a minimum amount of children.
You’re only interested in div
s that have exactly one h2
header and at least two paragraphs.
Select the desired div
s with the appropriate XPATH selector, making use of the count()
function.
# Select only divs with one header and at least two paragraphs
%>%
rules_html html_nodes(xpath = '//div[count(h2) = 1 and count(p) > 1]')
## {xml_nodeset (2)}
## [1] <div>\n <h2>Today's rules</h2>\n <p>Wear a mask</p>\n <p>Wash your han ...
## [2] <div>\n <h2>Tomorrow's rules</h2>\n <p>Wear a mask</p>\n <p>Wash your ...
Select directly from a parent element with XPATH’s text()
extract the function
information in parentheses into their own column, so you are required to extract a data frame with not two, but three columns: actors
, roles
, and functions
.
<- "<html>
roles_html <table>
<tr>
<th>Actor</th>
<th>Role</th>
</tr>
<tr>
<td class = 'actor'>Jayden Carpenter</td>
<td class = 'role'><em>Mickey Mouse</em> (Voice)</td>
</tr>
</table>
</html>"
<- read_html(roles_html) roles_html
Extract the actors
and roles
from the table using XPATH.
# Extract the actors in the cells having class "actor"
<- roles_html %>%
actors html_nodes(xpath = '//table//td[@class = "actor"]') %>%
html_text()
actors
## [1] "Jayden Carpenter"
# Extract the roles in the cells having class "role"
<- roles_html %>%
roles html_nodes(xpath = '//table//td[@class = "role"]/em') %>%
html_text()
roles
## [1] "Mickey Mouse"
Then, extract the function
using the XPATH text()
function.
Extract only the text with the parentheses, which is contained within the same cell as the corresponding role, and trim leading spaces.
# Extract the functions using the appropriate XPATH function
<- roles_html %>%
functions html_nodes(xpath = '//table//td[@class = "role"]/text()') %>%
html_text(trim = TRUE)
functions
## [1] "(Voice)"
Combine extracted data into a data frame
Combine the three vectors actors
, roles
, and functions
into a data frame called cast
(with columns Actor
, Role
and Function
, respectively).
# Create a new data frame from the extracted vectors
<- tibble(
cast Actor = actors,
Role = roles,
Function = functions)
cast
## # A tibble: 1 x 3
## Actor Role Function
## <chr> <chr> <chr>
## 1 Jayden Carpenter Mickey Mouse (Voice)
12.4 Scraping Best Practices
*httr**
read_html()
actually issues an HTTP GET request if provided with a URL.
The goal of this exercise is to replicate the same query without read_html()
, but with httr methods instead.
Use only httr functions to replicate the behavior of read_html()
, including getting the response from Wikipedia and parsing the response object into an HTML document.
Check the resulting HTTP status code with the appropriate httr function.
# Get the HTML document from Wikipedia using httr
<- GET('https://en.wikipedia.org/wiki/Varigotti')
wikipedia_response # Parse the response into an HTML doc
<- content(wikipedia_response)
wikipedia_page # Check the status code of the response
status_code(wikipedia_response)
## [1] 200
a fundamental part of the HTTP system are status codes: They tell you if everything is okay (200) or if there is a problem (404) with your request.
It is good practice to always check the status code of a response before you start working with the downloaded page. For this, you can use the status_code()
function from the httr() package.
Add a custom user agent
There are two ways of customizing your user agent when using httr for fetching web resources:
Locally, i.e. as an argument to the current request method.
Globally via set_config()
.
Send a GET request to https://httpbin.org/user-agent
with a custom user agent that says "A request from a DataCamp course on scraping"
and print the response.
In this step, set the user agent locally.
# Pass a custom user agent to a GET query to the mentioned URL
<- GET('https://httpbin.org/user-agent', user_agent("A request from a DataCamp course on scraping"))
response # Print the response content
content(response)
## $`user-agent`
## [1] "A request from a DataCamp course on scraping"
Now, make that custom user agent ("A request from a Alec at LU"
) globally available across all future requests with set_config()
.
# Globally set the user agent to "A request from a DataCamp course on scraping"
set_config(add_headers(`User-Agent` = "A request from a Alec at LU"))
# Pass a custom user agent to a GET query to the mentioned URL
<- GET('https://httpbin.org/user-agent')
response # Print the response content
content(response)
## $`user-agent`
## [1] "A request from a Alec at LU"
Apply throttling to a multi-page crawler
You’ll find the name of the peak within an element with the ID "firstHeading"
, while the coordinates are inside an element with class "geo-dms"
, which is a descendant of an element with ID "coordinates"
.
Construct a read_html()
function that executes with a delay of a half second when executed in a loop.
<- c("https://en.wikipedia.org/w/index.php?title=Mount_Everest&oldid=958643874", "https://en.wikipedia.org/w/index.php?title=K2&oldid=956671989", "https://en.wikipedia.org/w/index.php?title=Kangchenjunga&oldid=957008408") mountain_wiki_pages
# Define a throttled read_html() function with a delay of 0.5s
<- slowly(read_html,
read_html_delayed rate = rate_delay(0.5))
Now write a for
loop that goes over every page URL in the prepared variable mountain_wiki_pages
and stores the HTML available at the corresponding Wikipedia URL into the html
variable
# Construct a loop that goes over all page urls
for(page_url in mountain_wiki_pages){
# Read in the html of each URL with a delay of 0.5s
<- read_html_delayed(page_url)
html }
Finally, extract the name of the peak as well as its coordinates using the correct CSS selectors given above and store it in peak
and coords
.
# Extract the name of the peak and its coordinates
<- html %>%
peak html_node("#firstHeading") %>% html_text()
<- html %>%
coords html_node("#coordinates .geo-dms") %>% html_text()
print(paste(peak, coords, sep = ": "))
}
Merge all the code chunks above to make it functional:
# Define a throttled read_html() function with a delay of 0.5s
<- slowly(read_html,
read_html_delayed rate = rate_delay(0.5))
# Construct a loop that goes over all page urls
for(page_url in mountain_wiki_pages){
# Read in the html of each URL with a delay of 0.5s
<- read_html_delayed(page_url)
html # Extract the name of the peak and its coordinates
<- html %>%
peak html_node("#firstHeading") %>% html_text()
<- html %>%
coords html_node("#coordinates .geo-dms") %>% html_text()
print(paste(peak, coords, sep = ": "))
}
## [1] "Mount Everest: 27°59′17″N 86°55′31″E"
## [1] "K2: 35°52′57″N 76°30′48″E"
## [1] "Kangchenjunga: 27°42′09″N 88°08′48″E"