EDS 240 – lecture2.2-distributions-slides

EDS 240: Lecture 2.2

Visualizing distributions

Week 2 | January 13^th, 2024

Visualizing data distribution?

Visualizing the spread of a numeric variable

“Core” distribution chart types

Histograms

Density plots

Ridgeline plots

Box plots

Violin plots

Examples show the distribution of penguin body masses (g) for Adelie, Chinstrap & Gentoo penguins.

The data: bottom temperatures at Mohawk Reef

The Santa Barbara Coastal Long Term Ecolgical Research (SBC LTER) site was established in 2000 to understand the ecology of coastal kelp forest ecosystems. A number of coastal rocky reef sites are outfitted with instrumentation that collect long-term monitoring data.

A photo of kelp fronds rising towards the ocean's surface.

The Santa Barbara Coastal Long Term Ecological Research site's logo. A creek running down from green mountains to coastal waters meets with ocean waves. Bull kelp floats beneath the surface of the ocean.

We’ll be exploring bottom temperatures recorded at Mohawk Reef, a near-shore rocky reef and one of the Santa Barbara Coastal (SBC) LTER research sites.

Data wrangling

Data are imported directly from the EDI Data Portal. Explore the metadata package online to learn more about these data.

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##                                    setup                                 ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#..........................load packages.........................
library(tidyverse)
library(chron)
library(naniar)
library(ggridges)
library(gghighlight)
library(ggbeeswarm)
library(see)
library(palmerpenguins) # for some minimal examples

#..........................import data...........................
mko <- read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-sbc.2007.17&entityid=02629ecc08a536972dec021f662428aa")

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##                                wrangle data                              ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mko_clean <- mko |>

  # keep only necessary columns ----
  select(year, month, day, decimal_time, Temp_bot, Temp_top, Temp_mid) |>

  # create datetime column (not totally necessary for our plots, but it can be helpful to know how to do this!) ----
  unite(date, year, month, day, sep = "-", remove = FALSE) |>
  mutate(time = chron::times(decimal_time)) |>
  unite(date_time, date, time, sep = " ") |>

  # coerce data types ----
  mutate(date_time = as_datetime(date_time, "%Y-%m-%d %H:%M:%S", tz = "GMT"), 
         year = as.factor(year),
         month = as.factor(month),
         day = as.numeric(day)) |>

  # add month name by indexing the built-in `month.name` vector ----
  mutate(month_name = month.name[month]) |> 

  # replace 9999s with NAs ----
  naniar::replace_with_na(replace = list(Temp_bot = 9999, 
                                         Temp_top = 9999, 
                                         Temp_mid = 9999)) |>

  # select/reorder desired columns ----
  select(date_time, year, month, day, month_name, Temp_bot, Temp_mid, Temp_top)

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##                            explore missing data                          ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#..........counts & percentages of missing data by year..........
see_NAs <- mko_clean |> 
  group_by(year) |> 
  naniar::miss_var_summary() |>
  filter(variable == "Temp_bot")

#...................visualize missing Temp_bot...................
bottom <- mko_clean |> select(Temp_bot)
missing_temps <- naniar::vis_miss(bottom)

Histograms - ggplot2::geom_histogram()

What are they?

Histograms are used to represent the distribution of a numeric variable(s), which is cut into several bins. The number of observations per bin is represented by the height of the bar.

Need:

a numeric variable with lots of values
meaningful differences between values

Important considerations:

bin width (30 bins by default)
too few / too many bins

Histograms - avoid plotting too many groups

Twelve groups (month_name) is too many groups – especially when the range of temperature values for each of our groups largely overlap:

mko_clean |> 
  mutate(month_name = factor(month_name, levels = month.name)) |> 
  ggplot(aes(x = Temp_bot, fill = month_name)) +
  geom_histogram(position = "identity", alpha = 0.5)

Histograms - adjustments

Small multiples
Fewer groups
Adjust colors
Modify bin widths

If you want to plot all groups, consider splitting them into small multiples. If so, does color add any valuable information? Remove if not:

mko_clean |> 
  mutate(month_name = factor(month_name, levels = month.name)) |> 
  ggplot(aes(x = Temp_bot)) +
  geom_histogram() +
  facet_wrap(~month_name)

Let’s instead compare just three months: April (generally the coldest month), October (generally a hot month), June (somewhere in between):

mko_clean |> 
  mutate(month_name = factor(month_name, levels = month.name)) |> 
  filter(month_name %in% c("April", "June", "October")) |> 
  ggplot(aes(x = Temp_bot, fill = month_name)) + # piping data into ggplot, so don't need to define `data` arg
  geom_histogram(position = "identity", alpha = 0.5)

Use fill to fill bars with a specified color(s) and color to outline bars with a specified color(s):

mko_clean |> 
  mutate(month_name = factor(month_name, levels = month.name)) |> 
  filter(month_name %in% c("April", "June", "October")) |> 
  ggplot(aes(x = Temp_bot, fill = month_name)) + 
  geom_histogram(position = "identity", alpha = 0.5,  color = "black") +
  scale_fill_manual(values = c("#2C5374", "#ADD8E6", "#8B3A3A"))

Modify binwidth (30 bins by default) – does a bin width of 1 (degree Celsius) actually make sense? Consider scale of interest. Also be mindful when using bins – too few bins will result in loss of distribution shape.

mko_clean |> 
  filter(month_name %in% c("April", "June", "October")) |> 
  mutate(month_name = factor(month_name, levels = month.name)) |> 
  ggplot(aes(x = Temp_bot, fill = month_name)) +
  geom_histogram(position = "identity", alpha = 0.5, binwidth = 1) +
  scale_fill_manual(values = c("#2C5374", "#ADD8E6", "#8B3A3A"))

Density plots - ggplot2::geom_density()

What are they?

A smoothed version of a histogram. Density plots are representations of the distribution of a numeric variable(s), which uses a kernel density estimate (KDE) to show the probability density function of the variable. The y-axis represents the estimated density, i.e. the relative likelihood of a value occurring. The area under each curve is equal to 1. Use a density plot when you are most concerned with the shape of the distribution.

Need:

a numeric variable with lots of values

Important considerations:

useful when you want to visualize the shape of your data (not affected by bin number)
does not indicate sample size
can be misleading with small data sets
band width, which affects level of smoothing

Density plots - avoid plotting too many groups

Similar to the histogram, twelve groups (month_name) is too many groups! Consider small multiples (using facet_wrap()) if you want to keep all groups.

mko_clean |> 
  mutate(month_name = factor(x = month_name, levels = month.name)) |> 
  ggplot(aes(x = Temp_bot, fill = month_name)) +
  geom_density(alpha = 0.5)

Density plots - adjustments

Small multiples
Fewer groups
Modify band widths

If you want to plot all groups, consider splitting them into small multiples. If so, does color add any valuable information? Remove if not:

mko_clean |> 
  mutate(month_name = factor(month_name, levels = month.name)) |> 
  ggplot(aes(x = Temp_bot)) +
  geom_density(fill = "gray30") +
  facet_wrap(~month_name)

Let’s instead compare three months: April (generally the coldest month), October (generally a hot month), June (somewhere in between):

mko_clean |> 
  filter(month_name %in% c("April", "June", "October")) |> 
  ggplot(aes(x = Temp_bot, fill = month_name)) +
  geom_density(alpha = 0.5) + 
  scale_fill_manual(values = c("#2C5374", "#ADD8E6", "#8B3A3A"))

Modify bandwidth by declaring a multiplier of the default bandwidth adjustment. Reducing the adjust argument reduces the amount of smoothing (default adjust = 1):

mko_clean |> 
  filter(month_name %in% c("April", "June", "October")) |> 
  ggplot(aes(x = Temp_bot, fill = month_name)) +
  geom_density(alpha = 0.5, adjust = 0.5) + 
  scale_fill_manual(values = c("#2C5374", "#ADD8E6", "#8B3A3A"))

An important distinction

Histograms show us the counts (frequency) of values in each range (bin), represented by the height of the bars.

Density plots show the proportion of values in each range (area under the curve equal 1; peaks indicate where more values are concentrated, but it does not tell us anything about the the number of observations).

We’ll use some dummy data to demonstrate how this differs visually:

dummy_data <- data.frame(value = c(rnorm(n = 100, mean = 5),
                                   rnorm(n = 200, mean = 10)),
                         group = rep(c("A", "B"),
                                     times = c(100, 200)))

Here, we have two groups (A, B) of values which are normally distributed, but with different means. Group A also has a smaller sample size (100) than group B (200).

An important distinction

It’s easy to see that group B has a larger sample size than group A when looking at our histogram. Additionally, we can get a good sense of our data distribution. But what happens when you reduce the number of bins (e.g. set bins = 4)?

ggplot(dummy_data, aes(x = value, fill = group)) +
  geom_histogram(position = "identity", alpha = 0.7) +
  geom_rug(aes(color = group), alpha = 0.75)

We lose information about sample size in our density plot (note that both curves are ~the same height, despite group B having 2x as many observations). However, they’re great for visualizing the shape of our distributions since they are unaffected by the number of bins.

ggplot(dummy_data, aes(x = value, fill = group)) +
  geom_density(alpha = 0.7) +
  geom_rug(aes(color = group), alpha = 0.75)

Combining geoms - histogram & density plot

We can overlay histogram and density plots to check that smoothing assumptions of the density curve align with the actual data distribution. This requires rescaling the histogram to match the density curve scale. Adding y = after_stat(density) within the aes() function rescales the histogram counts so that bar areas integrate to 1:

ggplot(mko_clean, aes(x = Temp_bot, y = after_stat(density))) + # scale down hist to match density curve
  geom_histogram(fill = "gray", color = "black", alpha = 0.75) +
  geom_density(size = 1)

after_stat(density) tells ggplot to use the computed density stat for the y aesthetic
by default, a histogram computes counts/frequency for the bins, but with after_stat(density), y values for the histogram bins will represent the density rather than counts
after_stat() needs to be used to delay the computation of this y-axis aesthetic until after ggplot executes the geom_histogram() layer (which occurs second, after aesthetic mappings inside ggplot())

See https://stackoverflow.com/questions/46734555/ggplot2-histogram-why-do-y-density-and-stat-density-differ

Insights:

The density curve can help identify patterns that might be harder to see in the histogram due to bin size or alignment.
If the bins are too wide, important details in the data distribution might be missed in the histogram, but the density curve can reveal them.
Conversely, if the curve is overly smoothed (due to a large adjust), it might miss sharp features present in the histogram.

Scaled density plots for comparing groups to a whole

In a normal density plot, the area under the curve(s) is equal to 1. In a scaled density plot, the area under the curve reflects the number of observations for each group.

We can use scaled density plots to compare individual group distributions to the total distribution. Demonstrated here, using the penguins data set:

# use `after_stat(count)` to plot density of observations ----
ggplot(penguins, aes(x = body_mass_g, y = after_stat(count))) +
 
  # plot full distribution curve with label "all penguins"; remove 'species' col so that this doesn't get faceted later on ----
  geom_density(data = select(penguins, -species), 
               aes(fill = "all penguins"), color = "transparent") +
  
  # plot second curve with label "species" ----
  geom_density(aes(fill = "species"), color = "transparent") +
  
  # facet wrap by species ----
  facet_wrap(~species, nrow = 1) +
  
  # update colors, x-axis label, legend position ----
  scale_fill_manual(values = c("grey","green4"), name = NULL) +
  labs(x = "Body Mass (g)") +
  theme(legend.position = "top")

Ridgeline plots - {ggridges}

What are they?

Ridgeline plots show the distribution of a numeric variable for multiple groups.

Need:

a numeric variable with lots of values

Important considerations:

work well when you have multiple (> 3) groups
works well when there is a clear pattern in the result (e.g. if there is an obvious ranking in groups) and / or when visualizing changes in distributions over time or space

Ridgeline plots - good for multiple groups

The {ggridges} package has a number of different geoms for creating ridgeline plots that work well for data sets with larger group numbers (e.g. months). Two great geoms to explore (to start):

geom_density_ridges() to create a basic ridgeline plot:

ggplot(mko_clean, aes(x = Temp_bot, y = month_name)) +
  ggridges::geom_density_ridges()

geom_density_ridges_gradient() to fill with a color gradient:

ggplot(mko_clean, aes(x = Temp_bot, y = month_name, fill = after_stat(x))) +
  ggridges::geom_density_ridges_gradient() +
  scale_fill_gradientn(colors = c("#2C5374","#849BB4", "#D9E7EC", "#EF8080", "#8B3A3A"))

Ridgeline plots - adjustments

Group order
Overlap & tails
Quantiles
Jitter raw data

Order by month (ideal, since months have an inherent order; alternatively, do in data wrangling step, e.g. mutate(month_name = factor(month_name, levels = rev(month.name)))):

ggplot(mko_clean, aes(x = Temp_bot, y = month_name, fill = after_stat(x))) +
  ggridges::geom_density_ridges_gradient() +
  scale_y_discrete(limits = rev(month.name)) +
  scale_fill_gradientn(colors = c("#2C5374","#849BB4", "#D9E7EC", "#EF8080", "#8B3A3A"))

Order by mean or median (makes more sense when you have unordered groups):

mko_clean |> 
  mutate(month_name = fct_reorder(month_name, Temp_bot, .fun = mean)) |> 
  ggplot(mko_clean, mapping = aes(x = Temp_bot, y = month_name, fill = after_stat(x))) +
  ggridges::geom_density_ridges_gradient() +
  scale_fill_gradientn(colors = c("#2C5374","#849BB4", "#D9E7EC", "#EF8080", "#8B3A3A"))

rel_min_height sets the threshold for the relative height of density curves – any density values below this threshold are treated as 0. scale controls the extent to which the different densities overlap)

ggplot(mko_clean, aes(x = Temp_bot, y = month_name, fill = after_stat(x))) +
  ggridges::geom_density_ridges_gradient(rel_min_height = 0.01, scale = 3) +
  scale_y_discrete(limits = rev(month.name)) +
  scale_fill_gradientn(colors = c("#2C5374","#849BB4", "#D9E7EC", "#EF8080", "#8B3A3A"))

Include a median line by using the stat_density_ridges() geom and setting the number of quantiles to 2:

ggplot(mko_clean, aes(x = Temp_bot, y = month_name)) +
  ggridges::stat_density_ridges(rel_min_height = 0.01, scale = 3,
                                quantile_lines = TRUE, quantiles = 2) +
  scale_y_discrete(limits = rev(month.name))

Visualize the raw data underlying the density ridges (since our temperature data is too large (>473,000 rows), so we’ll use the {palmerpenguins} penguins data set to demo):

Jittered points

ggplot(penguins, aes(x = body_mass_g, y = species)) +
  ggridges::geom_density_ridges(jittered_points = TRUE, 
                                alpha = 0.5, point_size = 0.5)

Raincloud plot:

ggplot(penguins, aes(x = body_mass_g, y = species)) +
  ggridges::geom_density_ridges(jittered_points = TRUE, alpha = 0.5, 
                                point_size = 0.5, scale = 0.6,
                                position = "raincloud")

Box plots - ggplot2::geom_boxplot()

What are they?

Box plots summarize the distribution of a numeric variable for one or several groups.

Need:

a numeric variable, often with multiple groups

Important considerations:

box plots summarize data, meaning we can’t see the underlying shape of the distribution or sample size
add jittered points on top, or if large sample size, consider a violin plot

Box plots - good for multiple groups

Box plots are great for a few to multiple groups (too many boxes just results in a lot of information to synthesize, as a viewer). If your x-axis text is long, consider flipping your axes to make them less crunched:

ggplot(mko_clean, aes(x = month_name, y = Temp_bot)) +
  geom_boxplot() +
  scale_x_discrete(limits = rev(month.name)) +
  coord_flip()

Box plots - adjustments

Outliers
Highlight a group(s)
Jitter raw data
Dodged groups
Overlay beeswarm

You can modify outlier aesthetics inside geom_boxplot():

ggplot(mko_clean, aes(x = month_name, y = Temp_bot)) +
  geom_boxplot(outlier.color = "purple", outlier.shape = 1, outlier.size = 5) +
  scale_x_discrete(limits = rev(month.name)) +
  coord_flip()

Highlight a group of interest – one easy way to do so is by using the {gghighlight} package. Here, we specify a specific month ("October") to highlight:

mko_clean |> 
  ggplot(aes(x = month_name, y = Temp_bot, fill = month_name)) +
  geom_boxplot() +
  scale_x_discrete(limits = rev(month.name)) +
  gghighlight::gghighlight(month_name == "October") +
  coord_flip() +
  theme(legend.position = "none")

Since box plots hide sample size, consider overlaying raw data points using geom_jitter() (since our temperature data is too large (almost 500k rows), we’ll use the penguins data set to demo):

NOTE: Be sure to remove outliers, since plotting raw data will result in those data points being a second time:

ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(alpha = 0.5, width = 0.2)

You may have data where you want to include an additional grouping variable – for example, let’s say we want to plot penguin body masses by species and year. We’ll need to at least dodge our overlaid points so that they sit on top of the correct box. Preferably, we both jitter and dodge our points:

ggplot(penguins, aes(x = species, y = body_mass_g, color = as.factor(year))) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(alpha = 0.5, 
             position = position_jitterdodge(jitter.width = 0.2))

Similar to overlaying the raw jittered data points, we can combine our box plot with a beeswarm plot using {ggbeeswarm}. Beeswarm plots visualize the density of data at each point, as well as arrange points that would normally overlap so that they fall next to one another instead. Consider using a standalone beeswarm plot here as well! We’ll again use the penguins data set to demo:

ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot(outlier.shape = NA) +
  ggbeeswarm::geom_beeswarm(size = 1)

Violin plots - ggplot2::geom_violin()

What are they?

Violin plots visualize the distribution of a numeric variable for one or several groups, where the shape of the violin represents the density estimate of the variable (i.e. the more data points in a specific range, the larger the violin is for that range). They provide more information about the underlying distribution than a box plot.

Need:

a numeric variable, often with multiple groups

Important considerations:

ordering groups by median value (when it makes sense) can make it easier to understand
consider showing raw data when comparing groups with very different sample sizes (e.g. half violin plot)

Violin plots - good for multiple groups with lots of data

Violin plots are great for a few to multiple groups, and are often a better choice than box plots when you have a very large data set (and overlaying jittered points looks busy or downright unreasonable). If your x-axis text is long, consider flipping your axes to make them less crunched:

ggplot(mko_clean, aes(x = month_name, y = Temp_bot)) +
  geom_violin() +
  scale_x_discrete(limits = rev(month.name)) +
  coord_flip()

Combining geoms - adjustments

Overlay boxplot
Half-violin half-dot plot

Overlaying a box plot inside a violin plot can be helpful in providing your audience with summary stats in a compact form:

ggplot(mko_clean, aes(x = month_name, y = Temp_bot)) +
  geom_violin() +
  geom_boxplot(width = 0.1, color = "gray", alpha = 0.5, 
               outlier.color = "red") +
  scale_x_discrete(limits = rev(month.name)) +
  coord_flip()

The {see} package provides geom_violindot(), which is useful for simultaneously visualizing distribution and sample size. Because it can quickly get overcrowded with large sample sizes (like Temp_bot), we’ll use penguins to demo here:

ggplot(penguins, aes(x = species, y = bill_length_mm, fill = species)) +
  see::geom_violindot(size_dots = 5, alpha = 0.5) +
  theme(legend.position = "none")

Take a Break

~ This is the end of Lesson 2 (of 3) ~

05:00