Visualizing the spread of a numeric variable

EDS 240: Lecture 2.2
Visualizing distributions
Week 2 | January 14th, 2026
Visualizing data distribution?
Visualizing the spread of a numeric variable

“Core” distribution chart types
Histograms

Density plots

Ridgeline plots

Box plots

Violin plots

Examples show the distribution of penguin body masses (g) for Adelie, Chinstrap & Gentoo penguins.
The data: bottom temperatures at Mohawk Reef
The Santa Barbara Coastal Long Term Ecolgical Research (SBC LTER) site was established in 2000 to understand the ecology of coastal kelp forest ecosystems. A number of coastal rocky reef sites are outfitted with instrumentation that collect long-term monitoring data.


We’ll be exploring bottom temperatures recorded at Mohawk Reef, a near-shore rocky reef and one of the Santa Barbara Coastal (SBC) LTER research sites.
Data wrangling
Download data from the EDI Data Portal & explore the full metadata package to learn more about these data.
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## setup ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#..........................load packages.........................
library(tidyverse)
library(chron)
library(naniar)
library(ggridges)
library(gghighlight)
library(ggbeeswarm)
library(palmerpenguins) # for some minimal examples
#..........................import data...........................
mko <- read_csv(here::here("week2", "data", "mohawk_mooring_mko_20250117.csv"))
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## wrangle data ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mko_clean <- mko |>
# keep only necessary columns ----
select(year, month, day, decimal_time, Temp_bot, Temp_top, Temp_mid) |>
# create datetime column (not totally necessary for our plots, but it can be helpful to know how to do this!) ----
unite(date, year, month, day, sep = "-", remove = FALSE) |>
mutate(time = chron::times(decimal_time)) |>
unite(date_time, date, time, sep = " ") |>
# coerce data types ----
mutate(date_time = as_datetime(date_time, "%Y-%m-%d %H:%M:%S", tz = "GMT"),
year = as.factor(year),
month = as.factor(month),
day = as.numeric(day)) |>
# add month name by indexing the built-in `month.name` vector ----
mutate(month_name = month.name[month]) |>
# replace 9999s with NAs ----
naniar::replace_with_na(replace = list(Temp_bot = 9999,
Temp_top = 9999,
Temp_mid = 9999)) |>
# select/reorder desired columns ----
select(date_time, year, month, day, month_name, Temp_bot, Temp_mid, Temp_top)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## explore missing data ----
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Missing data can have unexpected effects on your analyses
# It's important to explore your data for missing values so that you can decide if and how to handle them
# Data loggers can be prone to missingness (e.g. full memory, dead batteries, replacement logger)
# We can use {naniar} to explore the frequency and patterns in missing data
# Below is a short example of some naniar tools for doing so
# Check out The Missing Book (https://tmb.njtierney.com/) by Nick Tierney and Allison Horst for some great guidance
#..........counts & percentages of missing data by year..........
see_NAs <- mko_clean |>
group_by(year) |>
naniar::miss_var_summary() |>
filter(variable == "Temp_bot")
#...................visualize missing Temp_bot...................
bottom <- mko_clean |> select(Temp_bot)
missing_temps <- naniar::vis_miss(bottom)Histograms vs. Density Plots
Both of these plots show the distribution of a numeric variable (ocean bottom temperature in °C).
Histogram

Density plot

What can you glean from each of these?
Histograms vs. Density Plots
Histogram

Numeric variable is divided into several bins. The height of each bar represents the number of observations in that bin.
Density plot

Smoothed versions of a histogram, which use a kernel density estimate (KDE) to show the variable’s probability density function. The y-axis represents the estimated density, i.e. the relative likelihood of a value occurring. The area under each curve sums to 1.
Check out this cool interactive tool, by Matthew Conlen, for a visual explanation of KDE.
An important distinction
Histograms show the counts (frequency) of values in each range (bin), represented by the height of the bars.
Density plots show the relative proportion of values across the range of a variable. The total area under the curve equals 1, and peaks indicate values are more concentrated. Density plots do not show the absolute number of observations.
We’ll use some dummy data to demonstrate how this differs visually:
Here, we have two groups (A, B) of values which are normally distributed, but with different means. Group A also has a smaller sample size (100) than group B (200).
An important distinction
We can see that group B has a larger sample size than group A when looking at our histogram. Additionally, we can get a good sense of our data distribution. But what happens when you reduce the number of bins (e.g. set bins = 4)?
We lose information about sample size in our density plot (note that both curves are ~the same height, despite group B having 2x as many observations). However, they’re** great for visualizing the shape of our distributions** since they are unaffected by the number of bins.
Rug plots added as an alternative way to visualize the data distribution and also as an indicator of sample size.
Considerations
Use a histogram or density plot when you want to learn about the distribution of a numeric variable that has lots of values (observations) with meaningful differences between those values. It’s also important to keep the following considerations in mind:


Avoid plotting too many groups at once
Histogram & density plots don’t work great when you have too many groups to plot at once. Twelve groups (month_name) is too many, especially when the range of temperature values for each of our groups largely overlap:
Consider faceting (small multiples)
If you want to plot all groups, consider splitting them into small multiples. If so, does color add any valuable information? Remove if not:
Consider plotting fewer groups
Do you need all (12) groups, or can you share the most relevant data using fewer groups? Let’s compare just three months: April (generally the coldest month), October (generally a hot month), and June (somewhere in between):
mko_clean |>
mutate(month_name = factor(month_name, levels = month.name)) |>
filter(month_name %in% c("April", "June", "October")) |>
ggplot(aes(x = Temp_bot, fill = month_name)) +
geom_histogram(position = "identity", alpha = 0.5, color = "black") +
scale_fill_manual(values = c("#2C5374", "#ADD8E6", "#8B3A3A"))
mko_clean |>
filter(month_name %in% c("April", "June", "October")) |>
ggplot(aes(x = Temp_bot, fill = month_name)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("#2C5374", "#ADD8E6", "#8B3A3A"))
Why are the months still in chronological order, despite not reordering them using mutate(), as we do for our histogram?
Modify bin / bandwidths
Modify binwidth (30 bins by default) – does a bin width of 1 (degree Celsius) actually make sense? Consider scale of interest. Also be mindful when using bins – too few bins will result in loss of distribution shape.
mko_clean |>
filter(month_name %in% c("April", "June", "October")) |>
mutate(month_name = factor(month_name, levels = month.name)) |>
ggplot(aes(x = Temp_bot, fill = month_name)) +
geom_histogram(position = "identity", alpha = 0.5, binwidth = 1) +
scale_fill_manual(values = c("#2C5374", "#ADD8E6", "#8B3A3A"))
Modify bandwidth by declaring a multiplier of the default bandwidth adjustment (default adjust = 1). A small bandwidth leads to undersmoothing, a large bandwidth leads to oversmoothing. Goal: accurately visualize the true underlying data distribution shape while reducing noise:
geom_density() relies on density(), which uses the nrd0 bandwidth selector (Silverman’s rule-of-thumb-style estimator). This is a reliable starting point, but you may consider other selectors based on your data. Check out thius article to read more about bandwidth selection.
Overlay histogram & density plots as a sanity check
Overlay histogram and density plots to check that smoothing assumptions of the density curve align with the actual data distribution. This requires rescaling the histogram to match the density curve scale. Adding y = after_stat(density) within the aes() function rescales the histogram counts so that bar areas integrate to 1:
What should you carefully consider when checking the smoothing asumptions of your density curve against a histogram?
Check out this great blog post on the after_stat() function, by June Choe
If you have multiple to many groups, consider these alternatives:
Ridgeline plots, box plots, and violin plots are better suited for visualizing the distribution of a numeric variable with many (e.g. >3) groups.
Ridgeline plot

Box plot

Violin plot

Appropriately ordering groups is important for improving readability. If a natural order exists (e.g. months of the year), use it. If not, order groups by a meaningful summary statistic, such as the median (e.g. ordering penguin species by median body weight).
Let’s build them!
Ridegeline plots for comparing distribution shapes across many ordered groups
Ridgeline plots are most effective when groups have a meaningful order, such as an inherent ranking, or when you want to visualize how distributions change over time (e.g. months, years) or space (e.g. longitude, elevation). The {ggridges} package has various geoms for creating ridgeline plots, including:
geom_density_ridges():
mko_clean |>
mutate(month_name = factor(month_name, levels = rev(month.name))) |> # alt, within ggplot: `scale_y_discrete(limits = rev(month.name))`
ggplot(aes(x = Temp_bot, y = month_name)) +
ggridges::geom_density_ridges(rel_min_height = 0.01, scale = 3) # `rel_min_height` sets threshold for relative height of density curves (any values below threshold treated as 0); `scale` controls extent to which different densities overlap
geom_density_ridges_gradient():
mko_clean |>
mutate(month_name = factor(month_name, levels = rev(month.name))) |> # alt, within ggplot: `scale_y_discrete(limits = rev(month.name))`
ggplot(aes(x = Temp_bot, y = month_name, fill = after_stat(x))) + # `fill = after_stat(x)` tells ggplot to compute the `x` values (representing `Temp_bot`) after the statistical transformation (density estimation) and map those computed `x` values to the `fill` aesthetic. As a result, the gradient fill of each density curve will reflect the temperature values along the x-axis.
ggridges::geom_density_ridges_gradient(rel_min_height = 0.01, scale = 3) + # `rel_min_height` sets threshold for relative height of density curves (any values below threshold treated as 0); `scale` controls extent to which different densities overlap
scale_fill_gradientn(colors = c("#2C5374","#849BB4", "#D9E7EC", "#EF8080", "#8B3A3A"))
Box plots for comparing distribution summaries across multiple groups
Box plots summarize data, meaning they don’t show the underlying shape of the distribution or sample size (though jittered points can be added, if appropriate). They provide a compact summary of a dataset’s center, spread and indications of skewness, and allow many groups to be compared side-by-side while remaining readable.
Box plots for comparing distribution summaries across multiple groups
Box plots summarize data, meaning they don’t show the underlying shape of the distribution or sample size (though jittered points can be added, if appropriate). They provide a compact summary of a dataset’s center, spread and indications of skewness, and allow many groups to be compared side-by-side while remaining readable.
Image source: A complete guid to box plots, by Mike Yi (Atlassian)
Box plots for comparing distribution summaries across groups
If your x-axis text is long, consider flipping your axes to make them less crunched:
What are the tradeoffs between reordering groups within a ggplot (as above) vs. during the data wrangling stage (e.g. as we did for our histogram and density and ridgeline plots)?
Highlight group(s) of interest to focus attention
The {gghighlight} package makes this super easy:
Overlay jittered data, if appropriate
Since box plots hide sample size, consider overlaying raw data points using geom_jitter(). It’s important that you remove outliers, since overlaying raw data means those data points will be plotted a second time.
Overlaying raw data does not work when you have many observations:
Beeswarm plots as an alternative
Similar to overlaying raw jittered data points, we can combine our box plot with a beeswarm plot using the {ggbeeswarm} package. Beeswarm plots visualize the density of data at each point, as well as arrange points that would normally overlap so that they fall next to one another instead. Consider using a standalone beeswarm plot here as well! We’ll again use the penguins data set to demo:
Dodge when you have an additional grouping variable
You may have data where you want to include an additional grouping variable – for example, let’s say we want to plot penguin body masses by species and year. We’ll need to at least dodge our overlaid points so that they sit on top of the correct box. Preferably, we both jitter and dodge our points:
Violin plots for comparing distribution shapes across multiple groups
Violin plots show the shape of a distribution by visualizing the kernel density estimate (KDE) of a variable (i.e. ranges with more data points appear wider). They’re similar to density plots, but make it much easier to compare distribution shapes across multiple groups.
Image source: A complete guide to violin plots, by Mike Yi (Atlassian)
Overlay your violin plot with other geoms to add context
Be sure to order your groups appropriately (e.g. by natural order, or by median value) and consider overlaying another chart type (e.g. box plot for many data points, rug or beeswarm for low-medium number of data points) for additional context. Rotate your axes if your x-axis text is long:
Take a Break
05:00