Introduction to HTML and Web Scraping
We can use the rvest function using the rvest function from the rvest library to read HTML code into R.
## Loading required package: xml2
html_excerpt_raw <- '
<html>
<body>
<h1>Web scraping is cool</h1>
<p>It involves writing code – be it R or Python.</p>
<p><a href="https://datacamp.com">DataCamp</a>
has courses on it.</p>
</body>
</html>'
# Turn the raw excerpt into an HTML document R understands
html_excerpt <- read_html(html_excerpt_raw)
html_excerpt
## {html_document}
## <html>
## [1] <body> \n <h1>Web scraping is cool</h1>\n <p>It involves writing co ...
We can use the xml_structure() function to get a better overview of the tag hierarchy of the HTML excerpt.
xml_structure(html_excerpt)
## <html>
## <body>
## {text}
## <h1>
## {text}
## {text}
## <p>
## {text}
## {text}
## <p>
## <a [href]>
## {text}
## {text}
## {text}
We can extract certain nodes from a html using the html_node(‘name of element(children)’) function:
# Read in the corresponding HTML string
list_html <- read_html(list_html_raw)
# Extract the ol node
ol_node <- list_html %>% html_node('ol')
# Extract and print all the children from ol_node
ol_node %>% html_children()
We can parse the HRTML links into an R data frame. You’ll use tibble(), a function from the Tidyverse, for that. tibble() is basically a trimmed down version of data.frame(), which we certainly already know. Just like data.frame(), we specify column names and data as pairs of column names and values, like so:
my_tibble <- tibble(
column_name_1 = value_1,
column_name_2 = value_2,
…
)
Example HTML:
Helpful links
Wikipedia
Dictionary
Search Engine
Compiled with help from Google.
# Extract all the a nodes from the bulleted list
links <- hyperlink_html %>%
read_html() %>%
html_nodes('li a') # 'ul a' is also correct!
# Extract the needed values for the data frame
domain_value = links %>% html_attr('href')
name_value = links %>% html_text()
# Construct a data frame
link_df <- tibble(
domain = domain_value,
name = name_value
)
link_df
Turn a table into a data frame with html_table()
If a table has a header row (with th elements) and no gaps, scraping it is straightforward, as with the following table (having ID “clean”):
Mountain Height First ascent Country
Mount Everest 8848 1953 Nepal, China
…
Here’s the same table (having ID “dirty”) without a designated header row and a missing cell in the first row:
Mountain Height First ascent Country
Mount Everest 8848 1953
…
For such cases, html_table() has a couple of extra arguments you can use to correctly parse the table, as shown in the video.
Both tables are contained within the mountains_html document.
# Extract the "clean" table into a data frame
mountains <- mountains_html %>%
html_node("table#clean") %>%
html_table()
mountains
Navigation and Selection with CSS
CSS selectors are basically what we use to stylize a website. WE set certain styles to each key and then use that while writing the code in html.
So, CSS can be used to style a web page. In the most basic form, this happens via type selectors, where styles are defined for and applied to all HTML elements of a certain type. In turn, you can also use type selectors to scrape pages for specific HTML elements.
As demonstrated in the video, you can also combine multiple type selectors via a comma, i.e. with html_nodes(“type1, type2”). This selects all elements that have type1 or type2.
Example code HTML:
Python is perfect for programming.
Still, R might be better suited for data analysis.
(And has prettier charts, too.)
# Read in the HTML
languages_html <- read_html(languages_raw_html)
# Select the div and p tags and print their text
languages_html %>%
html_nodes('div, p') %>%
html_text()
CLASSES AND ID’s
We can use classes and IDs instead of using all the elements. For example: To find the shortest possible selector to select the first div in structured_html pf thje following html code :
<h1 class = 'big'>Joe Biden</h1>
<p class = 'first blue'>Democrat</p>
<p class = 'second blue'>Male</p>
</div>
<div id = 'second'>...</div>
<div id = 'third'>
<h1 class = 'big'>Donald Trump</h1>
<p class = 'first red'>Republican</p>
<p class = 'second red'>Male</p>
</div>
We use the following code :
# Select the first div
structured_html %>%
html_nodes('#first')
CSS combinators:
Select direct descendants with the child combinator
By now, you surely know how to select elements by type, class, or ID. However, there are cases where these selectors won’t work, for example, if you only want to extract direct descendants of the top ul element. For that, you will use the child combinator (>) introduced in the video.
Here, your goal is to scrape a list (contained in the languages_html document) of all mentioned computer languages, but without the accompanying information in the sub-bullets:
-
SQL
-
R
-
Collection
-
Analysis
-
Visualization
-
Python
We use the following code:
# Extract only the text of the computer languages (without the sub lists)
languages_html %>%
html_nodes('ul#languages > li') %>%
html_text()
Advanced Selection with XPATH
XPath is defined as XML path. It is a syntax or language for finding any element on the web page using the XML path expression.
Example : Your goal is to extract the precipitation reading from this weather station. Unfortunately, it can’t be directly referenced through an ID.
HTML code :
<h1 class = 'big'>Berlin Weather Station</h1>
<p class = 'first'>Temperature: 20°C</p>
<p class = 'second'>Humidity: 45%</p>
</div>
<div id = 'second'>...</div>
<div id = 'third'>
<p class = 'first'>Sunshine: 5hrs</p>
<p class = 'second'>Precipitation: 0mm</p>
</div>
# Select all p elements
weather_html %>%
html_nodes(xpath = '//p')
# Select p elements with the second class
weather_html %>%
html_nodes(xpath = '//p[@class = "second"]')
# Select p elements that are children of "#third"
weather_html %>%
html_nodes(xpath = '//*[@id = "third"]/p')
# Select p elements with class "second" that are children of "#third"
weather_html %>%
html_nodes(xpath = '//*[@id = "third"]/p[@class = "second"]')
The position() function:
position( ) is one of the predefined methods of XPath Language which is used in XPath Statement to locate the element node at the specified position of the specified node type.
The position() function is very powerful when used within a predicate. Together with operators, you can basically select any node from those that match a certain path.
Example:
HTML CODE :
…
Today’s rules
Wear a mask
Wash your hands
Tomorrow’s rules
Wear a mask
Wash your hands
Bring hand sanitizer with you
…
# Select every p except the second from every div
rules_html %>%
html_nodes(xpath = '//div/p[position() != 2]') %>%
html_text()
# Select the text of the last three nodes of the second div
rules_html %>%
html_nodes(xpath = '//div[position() = 2]/*[position() >= 2]') %>%
html_text()
the XPATH count() function can be used within a predicate to narrow down a selection to these nodes that match a certain children count. This is especially helpful if your scraper depends on some nodes having a minimum amount of children.
Example:
HTML CODE:
…
Tomorrow
Berlin
Temperature: 20°C
Humidity: 50%
Zurich
Temperature: 22°C
Humidity: 60%
…
# Select only divs with one header and at least two paragraphs
forecast_html %>%
html_nodes(xpath = '//div[count(h2) = 1 and count(p) > 1]')
XPATH text() function:
Sometimes, you only want to select text that’s a direct descendant of a parent element. In the following example table, however, the name of the role itself is wrapped in an em tag. But its function, e.g. “Voice”, is also contained in the same td element as the em part, which is not optimal for querying the data.
Here’s an excerpt from the HTML code:
Actor
|
Role
|
Jayden Carpenter
|
Mickey Mouse (Voice)
|
…
In this exercise, you will try and scrape the table using a known rvest function. By doing so, you will recognize its limits.
The roles_html variable contains the document with the table.
# Extract the data frame from the table using a known function from rvest
roles <- roles_html %>%
html_node(xpath = "//table") %>%
html_table()
# Print the contents of the role data frame
roles
# Extract the actors in the cells having class "actor"
actors <- roles_html %>%
html_nodes(xpath = '//table//td[@class = "actor"]') %>%
html_text()
actors
# Extract the roles in the cells having class "role"
roles <- roles_html %>%
html_nodes(xpath = '//table//td[@class = "role"]/em') %>%
html_text()
roles
# Extract the functions using the appropriate XPATH function
functions <- roles_html %>%
html_nodes(xpath = '//table//td[@class = "role"]/text()') %>%
html_text(trim = TRUE)
functions
Scraping Best Practices
Here’s some rvest code that I used to find out the elevation of a beautiful place where I recently spent my vacation.
Get the HTML document from Wikipedia
wikipedia_page <- read_html(‘https://en.wikipedia.org/wiki/Varigotti’)
Parse the document and extract the elevation from it
wikipedia_page %>%
html_nodes(‘table tr:nth-child(9) > td’) %>%
html_text()
As you have learned in the video, read_html() actually issues an HTTP GET request if provided with a URL, like in this case.
The goal of this exercise is to replicate the same query without read_html(), but with httr methods instead.
Note: Usually rvest does the job, but if you want to customize requests like you’ll be shown later in this chapter, you’ll need to know the httr way.
For a little repetition, you’ll also translate the CSS selector used in html_nodes() into an XPATH query.
# Get the HTML document from Wikipedia using httr
wikipedia_response <- GET('https://en.wikipedia.org/wiki/Varigotti')
# Parse the response into an HTML doc
wikipedia_page <- content(wikipedia_response)
# Check the status code of the response
status_code(wikipedia_response)
# Get the HTML document from Wikipedia using httr
wikipedia_response <- GET('https://en.wikipedia.org/wiki/Varigotti')
# Parse the response into an HTML doc
wikipedia_page <- content(wikipedia_response)
# Check the status code of the response
status_code(wikipedia_response)
# Extract the altitude with XPATH
wikipedia_page %>%
html_nodes(xpath = '//table//tr[position() = 9]/td') %>%
html_text()