adapted from Cédric Scherer’s rstudio::conf(2022) workshop, Graphic Design with ggplot2
EDS 240: Lecture 1.3
{ggplot2} review
Week 1 | January 6th, 2024
Advantages of {ggplot2}
adapted from Cédric Scherer’s rstudio::conf(2022) workshop, Graphic Design with ggplot2
{ggplot2}
is based on the Grammar of Graphics
“A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g. the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics.”
-from Hadley Wickham’s A layered grammar of graphics in Journal of Computational and Graphical Statistics, vol. 19, no. 1 pp. 3-28, 2010.
“In the grammar of a language, words have different parts of speech, which perform different roles in the sentence. Analagously, the grammar of graphics separates a graphic into different layers”
-from Liz Sander’s post Telling stories with data using the grammar of graphics
{ggplot2}
graphic layers
First these:
1. data – in tidy format + define aesthetics (how variables map onto a plot e.g. axes, shape, color, size)
2. geometric objects (aka geoms) – define the type of plot(s)
Then these:
3. statistical transformations – algorithm used to calculate new values for a graph
4. position adjustments – control the fine details of position when geoms might otherwise overlap
5. coordinate system – change what x
and y
axes mean (e.g. Cartesian (default), polar, flipped)
6. facet – create subplots that each display one subset of the data
Note: You many not apply or customize all of the above layers (or in this exact order) for every plot you build
Enhance communication using additional layers
1. labels – add / update titles, axis / legend labels
2. annotations – add textual labels (e.g. to highlight specific data points or trend lines, etc.)
3. scales – update how the aesthetic mappings manifest visually (e.g. colors scales, axis ticks, legends)
4. themes – customize the non-data elements of your plot
5. layout – combine multiple plots into the same graphic
Note: You many not apply or customize all of the above layers (or in this exact order) for every plot you build
An aside . . .
Art by Allison Horst
What is tidy data?
Artwork by Allison Horst
Untidy data can take many different formats
Artwork by Allison Horst
An example: untidy temperatures
Take this tibble (very similar to a data.frame
) of temperature recordings at three stations on three dates:
temp_data_wide <- tribble(
~date, ~station1, ~station2, ~station3,
"2023-10-01", 30.1, 29.8, 31.2,
"2023-11-01", 28.6, 29.1, 33.4,
"2023-12-01", 29.9, 28.5, 32.3
)
print(temp_data_wide)
# A tibble: 3 × 4
date station1 station2 station3
<chr> <dbl> <dbl> <dbl>
1 2023-10-01 30.1 29.8 31.2
2 2023-11-01 28.6 29.1 33.4
3 2023-12-01 29.9 28.5 32.3
This tibble is in wide or untidy format.
Make tidy temperatures!
1.) What makes temp_data_wide
untidy?
2.) Sketch out on paper or talk through what temp_data_wide
would look like in long aka tidy format. Why?
02:00
An example: untidy temperatures
Multiple observations (temperature recordings) per row
Want more examples of untidy data? Check out these teaching materials from the NCEAS Learning Hub showcasing real-world examples of very untidy data.
An example: tidy temperatures
We can use tidyr::pivot_longer()
to “lengthen” our data, aka convert it from wide / untidy to long / tidy:
temp_data_long <- temp_data_wide |>
pivot_longer(cols = starts_with("station"),
names_to = "station_id",
values_to = "temp_c")
print(temp_data_long)
# A tibble: 9 × 3
date station_id temp_c
<chr> <chr> <dbl>
1 2023-10-01 station1 30.1
2 2023-10-01 station2 29.8
3 2023-10-01 station3 31.2
4 2023-11-01 station1 28.6
5 2023-11-01 station2 29.1
6 2023-11-01 station3 33.4
7 2023-12-01 station1 29.9
8 2023-12-01 station2 28.5
9 2023-12-01 station3 32.3
Benefits of tidy data
Artwork by Allison Horst
Data viz almost always begins with data wrangling
The {tidyverse}
is an “opinionated” set of packages – meaning they share similar philosophies, grammar, and data structures – that are incredibly useful for data wrangling, cleaning, and manipulation (and of course, visualization).
Check out the tidyverse website to learn more about each of these packages
The best resource for learning all things R for Data Science!
Okay, moving on . . .
Let’s make some ggplots using data from {palmerpenguins}
(which are already tidy)!
Artwork by Allison Horst
Plot #1
We’ll start by exploring the relationship between penguin bill length and bill depth. For this example, we’ll focus on understanding the following layers of a ggplot (bolded):
Graphic layers:
1. data – in tidy format + define aesthetics (how variables map onto a plot e.g. axes, shape, color, size)
2. geometric objects (aka geoms) – define the type of plot(s)
3. statistical transformations – algorithm used to calculate new values for a graph
4. position adjustments – control the fine details of position when geoms might otherwise overlap
5. coordinate system – change what x
and y
axes mean (e.g. Cartesian (default), polar, flipped)
6. facet – create subplots that each display one subset of the data
“Enhancing communication” layers:
1. labels – add / update titles, axis / legend labels
2. annotations – add textual labels (e.g. to highlight specific data points or trend lines, etc.)
3. scales – update how the aesthetic mappings manifest visually (e.g. colors scales, axis ticks, legends)
4. themes– customize the non-data elements of your plot
5. layout – combine multiple plots into the same graphic
Initialize a plot object
Initialize your plot object using ggplot()
– this creates a graph that’s primed to display the penguins
data set, but empty since we haven’t told ggplot how to map our data onto the graph yet (in other words: we haven’t told ggplot what variables to display and where, as well as what type of plot to create):
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Initialize a plot object + map aesthetics
The mapping
argument defines how variables in your data set are mapped to visual properties (aesthetics) of your plot. Here, we specify which variables map to our x
and y
axes:
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Omitting argument names
The data
and mapping
arguments are often not explicitly written in ggplot()
, as in the example below (makes for more concise code):
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Define a geom to represent data
Next, we’ll layer on a geometric object (aka geom) that our plot will use to represent our penguin data. There are many geoms (geom_*()
) that are built into {ggplot2}
already (and more when you use extension packages). To create a scatterplot:
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Use color to differentiate species
If we’d like to represent species using another aesthetic (e.g. color, shape, size), we need to modify our plot’s aesthetic (i.e. inside aes()
) – any time we want to modify the appearance of our plotted data based on a variable in our data set, we do so within aes()
. This process is known as scaling. A legend will automatically be added to indicate which values (in this case, colors) correspond to which level of our variable (in this case, species):
labels | annotations | scales | themes | layout
We can also map our own colors
Here, we use scale_color_manual()
to update the colors of our data points. Colors will be mapped from the levels in our data (i.e. Adelie
, Chinstrap
, Gentoo
) to the order of the aethetic values supplied ("darkorange"
,"purple"
,"cyan4"
):
labels | annotations | scales | themes | layout
Use color to describe a continuous variable
In the previous example, we mapped color to a categorical variable (species
). We can also map color to continuous variables (e.g. body_mass_g
):
labels | annotations | scales | themes | layout
What if we just want to color all points the same?
Do so within the corresponding geom_*()
and outside of the aes()
function! Color is no longer being mapped to a variable.
Defining data & mappings in geom_*()
You can also define the data and mapping layers within a geom_*()
(i.e. locally) rather than in ggplot()
(i.e. globally) – this is helpful if you plan to have multiple geoms with different mappings (e.g. you’re plotting data from multiple data frames). You must include the data
and mapping
argument names if you specify them out of order (see documentation):
# create a separate penguins_summary df ----
penguins_summary <- penguins |>
drop_na() |>
group_by(species) |>
summarize(
mean_bill_length_mm = mean(bill_length_mm),
mean_bill_depth_mm = mean(bill_depth_mm)
)
# create ggplot with layers from different dfs ----
ggplot() +
geom_point(data = penguins, mapping = aes(x = bill_length_mm, y = bill_depth_mm, color = species), alpha = 0.5) +
geom_point(data = penguins_summary, mapping = aes(x = mean_bill_length_mm, y = mean_bill_depth_mm, fill = species),size = 5, shape = 24)
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Mapping color at a local level
You may also separately map color at a local (i.e. within a specific geom) rather than global (i.e. within ggplot()
) level:
labels | annotations | scales | themes | layout
Why map locally?
Here, we use geom_smooth()
to add a best fit line (based on a l
inear m
odel, using method = "lm"
) to our plot:
Global mappings are passed down to each subsequent geom layer. Therefore, the color = species
mapping is also passed to geom_smooth()
, resulting in a best fit line for each species.
Local mappings (e.g. within geom_point()
) only apply to that particular layer. Therefore, the color = species
mapping is only applied to geom_point()
, and geom_smooth()
fits a best fit line to the entire data set.
labels | annotations | scales | themes | layout
Piping directly into a ggplot
Lastly, you can also pipe (using %>%
or |>
) directly from a data frame into a ggplot()
call (we’ll use more of this in future lessons) – this is useful if you need to do a bit of data wrangling first, but don’t want to create a whole new data frame. When doing so, omit the data
argument from ggplot()
:
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Plot #2
In this next example, we’ll explore penguin species counts. For this example, we’ll focus on understanding the following layers of a ggplot (bolded):
Graphic layers:
1. data – in tidy format + define aesthetics (how variables map onto a plot e.g. axes, shape, color, size)
2. geometric objects (aka geoms) – define the type of plot(s)
3. statistical transformations – algorithm used to calculate new values for a graph
4. position adjustments – control the fine details of position when geoms might otherwise overlap
5. coordinate system – change what x
and y
axes mean (e.g. Cartesian (default), polar, flipped)
6. facet – create subplots that each display one subset of the data
“Enhancing communication” layers:
1. labels – add / update titles, axis / legend labels
2. annotations – add textual labels (e.g. to highlight specific data points or trend lines, etc.)
3. scales – update how the aesthetic mappings manifest visually (e.g. colors scales, axis ticks, legends)
4. themes– customize the non-data elements of your plot
5. layout – combine multiple plots into the same graphic
Initialize + map aesthetics + define geom
Similar to our first scatterplot, we start by initializing our plot object with data, mapping our aesthetics, and defining a geometric object:
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
What is a statistical transformation?
Some geoms, like scatterplots, plot the raw values of your data set. Other geoms, like bar charts, histograms, boxplots, smoothers, etc. calculate new values to plot.
Each point on our scatterplot represents a raw observation value (one point = one penguin)
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
The default stat for geom_bar()
is “count”
Every geom has a default stat – meaning you can typically use geoms without worrying about the underlying statistical transformation.
The default statistical transformation used in geom_bar()
is count, which first groups our categorical variable (species
), then calculates a count for each unique level (Adelie
, Chinstrap
, Gentoo
).
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
We can override the default stat
Let’s say we have a data frame with calculated count values (e.g. penguins_summary
) that we’d like to plot using geom_bar()
. We can change stat = "count"
(default) to stat = "identity"
to generate bar heights based off the “identity” of values in the n
column of penguin_summary
.
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
We can override the default stat mapping
Now let’s say we’d like to display the same bar chart with y-axis values as proportions, rather than counts. We can override the default mapping of transformed variables to aesthetics with:
stat = "count"
counts the number of occurrences in each category (here, species
)y = after_stat(prop)
tells ggplot to calculate the proportion of each species
count relative to the total count (delayed until after counts have been computed)group = 1
ensures that proportions are calculated across all species – default behavior of geom_bar()
is to group by the x
variable to separately count the number of rows in each level (Adelie
, Chinstrap
, Gentoo
), but we actually need to know how many total penguins there are in the data set in order to calculate proportionsdata | geometric object | statistical transformation | position adjustment | coordinate system | facet
What is a position adjustment?
Position adjustments apply minor tweaks to the position of elements to resolve overlapping geoms. For example, let’s say we would like to visualize penguin counts by species (bar height) and by island (color) using our bar chart from earlier. We could add the fill
aesthetic:
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
The default position for geom_bar()
is “stack”
Every geom has a default position. The default position used in geom_bar()
is stack, which stacks bars on top of one another, based on the fill
value (here, that’s island
):
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Alternative position adjustments for geom_bar()
Below are a few position options available for use with geom_bar()
:
position = "fill"
creates a set of stacked bars but makes each set the same height (easier to compare proportions across groups)
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Alternatively, use position = position_*()
Instead of position = "X"
, you can use functions to update and further adjust your geom’s positions. Here, we’ll use position_dodge2()
to also ensure the widths of each of our bars are equal:
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
What is a coordinate system?
A Coordinate System is a system that uses one or more numbers (coordinates), to uniquely determine the position of points or other geometric elements. By default, ggplots are constructed in a Cartesian coordinate system, consisting of a horizontal x-axis and vertical y-axis.
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Changing coordinate systems
Depending on the type of data, axis label length, etc. it may make sense to change this coordinate system. Two options for our bar plot:
coord_flip()
switches the x and y axes.
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Use pre-made themes to update plot appearance
{ggplot2}
comes with a number of complete themes, which control all non-data display. See two examples below:
displays x and y axis lines and no gridlines
labels | annotations | scales | themes | layout
Further customize plot appearance using theme()
Further modify nearly any non-data element of your plot using theme()
.
labels | annotations | scales | themes | layout
Plot #3
In this next example, we’ll explore penguin flipper lengths. For this example, we’ll focus on understanding the following layers of a ggplot (bolded):
Graphic layers:
1. data – in tidy format + define aesthetics (how variables map onto a plot e.g. axes, shape, color, size)
2. geometric objects (aka geoms) – define the type of plot(s)
3. statistical transformations – algorithm used to calculate new values for a graph
4. position adjustments – control the fine details of position when geoms might otherwise overlap
5. coordinate system – change what x
and y
axes mean (e.g. Cartesian (default), polar, flipped)
6. facet – create subplots that each display one subset of the data
“Enhancing communication” layers:
1. labels – add / update titles, axis / legend labels
2. annotations – add textual labels (e.g. to highlight specific data points or trend lines, etc.)
3. scales – update how the aesthetic mappings manifest visually (e.g. colors scales, axis ticks, legends)
4. themes– customize the non-data elements of your plot
5. layout – combine multiple plots into the same graphic
Initialize + map aesthetics + define geom
We’ll again start by initializing our plot object with data, mapping our aesthetics, and defining a geometric object. Note that the default statistical transformation for geom_histogram()
is stat = "bin"
:
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Use color to differentiate species
Just like in our scatterplot (Plot #1), we’ll modify our plot’s aesthetics (i.e. inside aes()
) to color our histrogram bins according to the species variable. Unlike our scatterplot (which uses the color
argument), we’ll use the fill
argument to fill the bars with color (rather than outline them with color). We’ll also manually define our fill scale:
labels | annotations | scales | themes | layout
Update the default position to "identity"
Let’s update the position of our binned bars from "stack"
to "identity"
and also increase the transparency (using alpha
) so that we can see overlapping bars:
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Update / add plot labels
Update axis and legend titles and add a plot title using labs()
:
labels | annotations | scales | themes | layout
Create subplots using facets
Sometimes (particularly during the data exploration phase) it’s helpful to create subplots (i.e. separate panels) of your data. Here we use facet_wrap()
to separate our data by the species
variable. By default, it creates a 1 x 3 matrix of plots. We can manually specify how many rows or columns we’d like using nrow
or ncol
:
ggplot(penguins, aes(x = flipper_length_mm, fill = species)) +
geom_histogram(position = "identity", alpha = 0.5) +
scale_fill_manual(values = c("darkorange", "purple", "cyan4")) +
labs(x = "Flipper length (mm)", y = "Frequency", fill = "Species",
title = "Penguin Flipper Lengths") +
facet_wrap(~species, ncol = 1)
data | geometric object | statistical transformation | position adjustment | coordinate system | facet
Building a data viz is an iterative process!
We’ll spend the next two weeks learning how to build some basic fundamental charts and talking about important considerations when choosing a graphic form for presenting your data. Then, we’ll move into graphic design theory and the tools and packages in the {ggplot2}
ecosystem that make it possible.
Visualization by Cédric Scherer, from his blog post, The Evolution of a ggplot (Ep.1) – create your own ggplot evolution gif using the {camcorder}
package
See you next week!