5 Intro to Data Visualization with ggplot2

https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2

5.1 Introduction

Data Viz : i) A core skill in Data Science ii) Intersection betweeen Design and Statistics

To change numeric variables to categorical we use factor() function:

library(tidyverse)
## ── Attaching packages ───────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)

# Change the command below so that cyl is treated as factor
ggplot(mtcars, aes(factor(cyl), mpg)) +
  geom_point()

The grammar of graphics :

There are three eswsential grammatical elements:-

  1. Data (Data set being plotteed)
  2. Aesthetics (The scales onto which we map our data)
  3. Geometries (The visual elements used for our data)

Optional LAyers: i) Themes(All non-data ink) ii) Statistics (Representations of our data to aid understanding) iii) Coordinates ( THe space on which data will be plotted) iv) Facets (Plotting small multiples)

ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  geom_smooth() 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Alpha:

geom_point() has an alpha argument that controls the opacity of the points. A value of 1 (the default) means that the points are totally opaque; a value of 0 means the points are totally transparent (and therefore invisible). Values in between specify transparency.

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

5.2 Aesthetics

Typical visible aesthetics:

  1. x = X axis position
  2. y = Y axis position
  3. fill = Fill color
  4. size = Are or radius of points, thickness of lines
  5. alpha = Transparency
  6. linetype = line dash pattern
  7. labels = Text on a plot or axes
  8. shape = Shape

Aesthetic vs attributes:

# A hexadecimal color
my_blue <- "#4ABEFF"

# Change the color mapping to a fill mapping (AESTHETICS)
ggplot(mtcars, aes(wt, mpg, fill = hp)) +
  
  # Set point size and shape (ATTRIBUTES)
  geom_point(color = my_blue, size = 10, shape = 1)

In order to add text, we use the geom_text() function and to use row names of a data set, we set the label of the attribute - label as rownames(name of data set).

ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Add text layer with label rownames(mtcars) and color red
  geom_text(label = rownames(mtcars), color = 'red')

Modifying aesthetics:

  1. Identity = identity basically means dont do anything with the data.

  2. jitter = set arguements for the position and maintain consistency across plots and layers

We labs() to set the x- and y-axis labels. It takes strings for each argument.

Scale_color_manual() defines properties of the color scale (i.e. axis). The first argument sets the legend title. values is a named vector of colors to use.

To Implement a custom fill color scale we use scale_fill_manual()

palette <- c(automatic = "#377EB8", manual = "#E41A1C")

ggplot(mtcars, aes(cyl, fill = am)) +
  geom_bar(position = 'dodge') +
  labs(x = "Number of Cylinders", y = "Count")

  scale_fill_manual("Transmission", values = palette)
## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: fill
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: NA
##     name: Transmission
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: <ggproto object: Class RangeDiscrete, Range, gg>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range, gg>
##     rescale: function
##     reset: function
##     scale_name: manual
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

You can make univariate plots in ggplot2, but you will need to add a fake y axis by mapping y to zero.

When using setting y-axis limits, you can specify the limits as separate arguments, or as a single numeric vector. That is, ylim(lo, hi) or ylim(c(lo, hi))

ggplot(mtcars, aes(mpg, 0)) +
  geom_jitter() +
 ylim(-2,2)

Typically, the dependent variable is mapped onto the the y-axis and the independent variable is mapped onto the x-axis.

5.3 Geometries

We should be aware of overplotting: Aligning values on a single axis.

Overplotting 2: Aligned values

This occurs when one axis is continuous and the other is categorical, which can be overcome with some form of jittering.

plt_mpg_vs_cyl_by_am <- ggplot(mtcars, aes(cyl, mpg, color = am))

# Default points are shown for comparison
plt_mpg_vs_cyl_by_am + geom_point()

# Now jitter and dodge the point positions
plt_mpg_vs_cyl_by_am + geom_point(position = position_jitterdodge(jitter.width = 0.3, dodge.width = 0.3))

Overplotting 3: Low-precision data

Overplotting 4: Integer data

Positions in histograms:

stack (the default): Bars for different groups are stacked on top of each other. dodge: Bars for different groups are placed side by side. fill: Bars for different groups are shown as proportions. identity: Plot the values as they appear in the dataset.

ggplot(mtcars, aes(mpg, fill = am)) +
  geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)

Bar Plots:

geom_bar() [stat = “count”] : counts the number of cases at each X position geom_col() [stat = “identity”] : plots actual values

ggplot(mtcars, aes(cyl, fill = am)) +
  # Add a bar layer
  geom_bar()

ggplot(mtcars, aes(cyl, fill = am)) +
  # Change the position to "dodge"
  geom_bar(position = "dodge")

ggplot(mtcars, aes(cyl, fill = am)) +
  # Set the position to "fill"
  geom_bar(position = 'fill')

We can customize bar plots further by adjusting the dodging so that your bars partially overlap each other. Instead of using position = “dodge”, we’re going to use position_dodge()

The reason we want to use position_dodge() (and position_jitter()) is to specify how much dodging (or jittering) we want.

5.4 Themes

Themes are all non-data ink visual elements which are not part of the data. THere are three types of themes -

  1. Text : element_text()
  2. Line : element_line()
  3. Rectangle : element_rect()

Moving the legend: Legend is defined as an area of the graph plot describing each of the parts of the plot.

To change stylistic elements of a plot, call theme() and set plot properties to a new value. For example, the following changes the legend position.

[p + theme(legend.position = new_value)]

Here, the new value can be

  1. “top”, “bottom”, “left”, or “right’”: place it at that side of the plot.
  2. “none”: don’t draw it.
  3. c(x, y): c(0, 0) means the bottom-left and c(1, 1) means the top-right.
ggplot(mtcars, aes(mpg, fill = am)) +
  geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)

Bar Plots:

geom_bar() [stat = “count”] : counts the number of cases at each X position geom_col() [stat = “identity”] : plots actual values

ggplot(mtcars, aes(cyl, fill = am)) +
  # Set the position to "fill"
  geom_bar(position = 'fill') +
  theme(legend.position = c(0.6,0.1))

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme(
    # For all rectangles, set the fill color to grey92
    rect = element_rect(fill = "grey92"),
    # For the legend key, turn off the outline
    legend.key = element_rect(color = NA)
  )
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme(
    rect = element_rect(fill = "grey92"),
    legend.key = element_rect(color = NA),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    # Add major y-axis panel grid lines back
    panel.grid.major.y = element_line(
      # Set the color to white
      color = "white",
      # Set the size to 0.5
      size = 0.5,
      # Set the line type to dotted
      linetype = "dotted"
    )
  )
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Whitespace means all the non-visible margins and spacing in the plot.

To set a single whitespace value, use unit(x, unit), where x is the amount and unit is the unit of measure.

Borders require you to set 4 positions, so use margin(top, right, bottom, left, unit). To remember the margin order, think TRouBLe.

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme(
    # Set the plot margin to (10, 30, 50, 70) millimeters
    plot.margin = margin(10, 30, 50, 70, "pt")
  )
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

In addition to making your own themes, there are several out-of-the-box solutions that may save you lots of time.

theme_gray() is the default. theme_bw() is useful when you use transparency. theme_classic() is more traditional. theme_void() removes everything but the data

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme_classic()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme_void()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme_bw()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Exploring ggthemes:

Outside of ggplot2, another source of built-in themes is the ggthemes package.

library(ggthemes)

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme_fivethirtyeight()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme_tufte()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme_wsj()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Setting themes:

Reusing a theme across many plots helps to provide a consistent style. You have several options for this.

theme_custom <- theme(
  rect = element_rect(fill = "grey92"),
  legend.key = element_rect(color = NA),
  axis.ticks = element_blank(),
  panel.grid = element_blank(),
  panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
  axis.text = element_text(color = "grey25"),
  plot.title = element_text(face = "italic", size = 16),
  legend.position = c(0.6, 0.1)
)

#We can combine two themes.

theme_combine <- theme_custom + theme_wsj()

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme_combine
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

To set theme_custom as a default theme, we use :

theme_set(theme_custom)

Example:

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme_tufte() +
  # Add individual theme elements
  theme(
    # Turn off the legend
    legend.position = "none",
    # Turn off the axis ticks
    axis.ticks = element_blank(),
     # Set the axis title's text color to grey60
    axis.title = element_text(color = "grey60"),
    # Set the axis text's text color to grey60
    axis.text = element_text(color = "grey60"),
    # Set the panel gridlines major y values
    panel.grid.major.y = element_line(
      # Set the color to grey60
      color = "grey60",
      # Set the size to 0.25
      size = 0.25,
      # Set the linetype to dotted
      linetype = "dotted"
    )
  )
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Example : GAPMNIDER DATASET (Scratch Work)

Set the color scale: palette <- brewer.pal(5, “RdYlBu”)[-(2:4)]

ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) + geom_point(size = 4) + geom_segment(aes(xend = 30, yend = country), size = 2) + geom_text(aes(label = round(lifeExp,1)), color = “white”, size = 1.5) + scale_x_continuous("“, expand = c(0,0), limits = c(30,90), position =”top“) + scale_color_gradientn(colors = palette) + labs(title =”Highest and lowest life expectancies, 2007“, caption =”Source: gapminder")

How to define a theme:

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.4) +
  geom_smooth() +
  theme_tufte()  +
  theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none") +
        geom_vline(xintercept = 3, color = "grey40", linetype = 3) +
  annotate(
    "text",
    x = 3.5, y = 4900,
    label = "The\nrandom\nline",
    vjust = 1, size = 3, color = "grey40"
  ) +
  annotate(
    "curve",
    x = 3.5, y = 4900,
    xend = 3, yend = 5200,
    arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
    color = "grey40"
  ) +
  labs(title = "I MADE A COOL PLOT", caption = "OH YES!")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'