EDS 240 – lecture5.1-good-viz-slides

EDS 240: Lecture 5.1

What makes a good data viz?

Week 5 | February 3^rd, 2025

Looking forward . . .

Choosing the right graphic form is just the first step! It’s important to consider how you can enhance your visualization by:

applying pre-made and custom color palettes

updating fonts

adding annotations

fine-tuning themes

centering our primary message

We’ll start by familiarizing ourselves with a general set of rules and best practices for making “good” data viz.

Good data visualization design considers:

data-ink ratio (less is more, within reason)
how to reduce eye movement and improve readability / interpretability (e.g. through alternative legend positions, direct annotations)
putting things in context
how to draw the main attention to the most important info
consistent use of colors, spacing, typefaces, weights
typeface / font choices and how they affect both readability and emotions and perceptions
using visual hierarchy to guide the reader
color choices (incl. palette types, emotions, readability)
how to tell an interesting story
how to center the people and communities represented in your data
accessibility through colorblind-friendly palettes & alt text

The above should always be considered in your design process, but may not always be necessary

Good data visualization design considers:

data-ink ratio (less is more, within reason)
how to reduce eye movement and improve readability / interpretability (e.g. through alternative legend positions, direct annotations)
putting things in context
how to draw the main attention to the most important info
consistent use of colors, spacing, typefaces, weights
typeface / font choices and how they affect both readability and emotions and perceptions
using visual hierarchy to guide the reader
color choices (incl. palette types, emotions, readability)
how to tell an interesting story
how to center the people and communities represented in your data
accessibility through colorblind-friendly palettes & alt text

We’re going to talk about these first few points, to start.

Simplify plots to reduce eye movement & improve readability / interpretability

Data-Ink ratio: remove non-data ink

The Data-Ink ratio was introduced by Edward Tufte (1983) and argues that non-data-ink (i.e. ink used for for everything except the presentation of data itself) should be removed wherever possible.

\[ \text{Data-ink ratio} = \frac{\text{Data-ink}}{\text{Total ink used to print the graphic}} \]

Do so by starting with a complete theme (e.g. theme_classic(), theme_void()) and add / remove elements using theme().

Low Data-Ink ratio

High Data-Ink ratio

Maximizing the Data-Ink ratio isn’t always best

Eliminating lots of non-data ink may render visualizations difficult to read
- Inbar et al. (2007) found that students preferred a more maximalist visualization design over the minimalist version proposed by Tufte

Design choices depend on audience and purpose – how you choose to maximize your data-ink ratio will depend largely on who your visualization is for and the purpose it’s meant to serve (e.g. a scientific publication may have specific requirements for the design / aesthetics of a visualization, while an infographic-style visualization may leave space for more creative liberties)

A general rule of thumb: aim to maximize the data-ink ratio while not sacrificing overall readability, design, aesthetics.

Remove redundant legend information

Ask yourself, “Does this legend provide additional information that I can’t get elsewhere?”. If not, remove a legend using:

plot +
  theme(
    legend.position = "none"
  )

Doing so increases the data-ink ratio and reduces overall eye movement.

Add direct labels & minimize rotated text

We can use a combination of coord_flip() (or remap aesthetics), geom_text(), labs(), and theme() to further eliminate non-data ink and reduce overall eye movement.

A visualization like this might not be appropriate for all audiences / contexts (e.g. scientific journal) – but the takeaways remain clear, despite the removal of axes / text / legend.

Move the legend (positioning)

Reduce eye movement by updating the legend position (e.g. move it onto the plot panel):

plot + 
  theme(
    legend.position = c(x = 0.85, y = 0.15) # you'll need to adjust these values for your plot; values range from 0 - 1
  )

Original plot:

Updated legend position:

Also note the redundant species mapping (color and shape) – sometime redundancy is important for accessibility!

Move the legend (incorporate into title text)

Reduce eye movement and excess ink by including legend info in the plot (sub)title (here, using the {ggtext} package; minimal code example, below):

plot +
  labs(subtitle = "Some subtitle text where <span style='color:red;'>**these words**</span> are bolded and red") +
  theme(plot.subtitle = ggtext::element_markdown())

Original plot:

Legend as styled title text:

Move the legend (use direct labels)

Reduce eye movement and excess ink by including legend info as direct labels on the plot (here, using the {geomtextpath} package; minimal code example, below):

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = species)) +
  geom_point(size = 3, alpha = 0.8) +
  geomtextpath::geom_labelsmooth(aes(label = species), method = "lm", size = 5)

Original plot:

Legend as direct labels:

Use annotations to improve readability / interpretability

Is the y-axis necessary for this plot? What’s the author’s goal? How do annotations help achieve that goal?

Is white space always your friend?, by Neil Richards

02:00

Use annotations to improve readability / interpretability

“The key thing we do is to add a title to the chart, as an entry point and to explain what is going on. Text and other annotations add enourmous value for non-chart people.”

-John Burn-Murdoch, Financial Times

Consider ways to provide additional context for your data

Plot groups against the whole when faceting

Facets (aka small multiples) allow us to more easily view individual groups. Here, the author plots individual groups (male vs. female passenger distributions on the Titanic) against the data set total (distribution of all passengers):

The area under each curve corresponds to the total number of male and female passengers with known age (468 (M) and 288 (F)).

The colored areas show the density estimates of the ages of M and F passengers, and the gray areas show the overall passenger age distribution.

Add benchmark values

Add vertical (geom_vline()) or horizontal (geom_hline()) lines at important values:

A minimal code example:

plot + 
  geom_vline(xintercept = 11) +
  geom_vline(xintercept = 16) +
  geom_vline(xintercept = 21)

Add 1:1 line, if relevant

For data where the relevant comparison is the x = y line (e.g. scatter plots of paired data), plot the 1:1 line.

Below, the author compares gene expression levels in a mutant virus to the non-mutated (wild-type) variant. He presents three (increasingly better) versions of the same plot:

Bad

Better

Best

Watch this TED talk!

I highly recommend watching this awesome TED talk, The beauty of data visualization, by David McCandless. David does a fabulous job of showing the importance of providing context in your data visualizations.

Draw attention to important information / values

Use color to highlight groups / values

Highlight data by coloring groups of interest either manually or by using helpful packages, like {gghighlight} (we saw an example of this in lecture 2.3):

Use annotations to highlight groups / values

Or add annotations to your plots to call attention to data of interest (here, shown using the {ggforce} package; minimal code example, below):

plot + 
  ggforce::geom_mark_ellipse(aes(filter = species == "Gentoo", label = "Gentoo penguins", 
                                 description = "This species tends to have..."))

What doesn’t work so well in data visualization?

Good data visualization design generally avoids…

information overload (e.g. too many colors / shapes / fonts, groups, variables)
dual axes (can easily mislead audiences)
pie charts (really hard for humans to effectively compare the size of angles)
3D plots (distort perception and are generally distracting)

Our job is to make it as easy as possible for our readers to understand our data without having to do mental gymnastics. The chart types described above (more often than not) ask too much of our readers in their quest to understand the information being presented.

There may be circumstances where the above are executed well…but more often than not, you’re safest avoiding them.

Information overload is no fun . . .

It can be nearly impossible to process many different variables, colors, shapes, etc. on the same visualization (and realistically, most people won’t want to take the time to even try):

Source: Stack Exchange

Information overload is no fun . . .

It can be nearly impossible to process many different variables, colors, shapes, etc. on the same visualization (and realistically, most people won’t want to take the time to even try):

Source: Unknown, but borrowed from EDS 221

Information overload is no fun . . .

It can be nearly impossible to process many different variables, colors, shapes, etc. on the same visualization (and realistically, most people won’t want to take the time to even try):

Source: Unknown, but borrowed from EDS 221

Reduce information overload whenever possible

Consider some of the approaches we’ve already discussed:

highlighting the most important groups / values
faceting (small multiples)
creating separate visualizations
cohesive and intuitive color scheme

Or some that we haven’t covered:

create interactive tables and / or visualizations using htmlwidgets (e.g. leaflet maps, plotly, charts, DT data tables)
create reactive outputs using tools like {shiny}
- check out the EDS 296 (Intro to Shiny) materials as a starting point!

Dual y-axes can deliberately mislead readers

The scales of dual axis charts are arbitrary and therefore can (deliberately) mislead readers about the relationship between the two data series. Let’s take this example using real Worldbank data for the German GDP and the global GDP between 2004 and 2016:

Dual y-axes can deliberately mislead readers

The scales of dual axis charts are arbitrary and therefore can (deliberately) mislead readers about the relationship between the two data series. Let’s take this example using real Worldbank data for the German GDP and the global GDP between 2004 and 2016:

While both GDPs may appear to increase at the about same rate, they actually don’t – global GDP increased by 80% until 2014, while the German GDP increased by 40%.

Alternatives to dual y-axes: side-by-side charts

Separate your data series into side-by-side charts – this allows us to create two different axes for two different charts.

Alternatives to dual y-axes: indexed charts

Indexed charts show the relative change (percentage increase or decrease) of a data series over time. Consider adding labels or tooltips (e.g. using {plotly}) to include important absolute numbers.

Alternatives to dual y-axes: prioritize & label

Consider prioritizing and plotting the more important of the two data series. Then use annotations to add information about the omitted variable. This option may not work well for all data sets, but can be effective for dual-axis charts that present both absolute and relative numbers of the same measure.

Alternatives to dual y-axes: connected scatterplot

A connected scatterplot places one variable on the y-axis and the other on the x-axis (here, replacing time). Be mindful that these plots are generally less inutitive for a reader and may take more time to decipher patterns.

The problem with pie charts . . .

. . . is actually a problem with humans – we’re not so great at comparing angles. We’re bad at comparing angles within a single pie chart if they’re all similar:

The problem with pie charts . . .

. . . is actually a problem with humans – we’re not so great at comparing angles. We’re bad at comparing angles within a single pie chart if they’re all similar:

The problem with pie charts . . .

. . . is actually a problem with humans – we’re not so great at comparing angles. And we’re even worse at comparing angles across multiple pie charts:

Sometimes, pie charts can be a good option

ABC Enterprise Sales. Source: How to Use Charts and Graphs Effectively, by MindTools

IF you decide a pie chart is the right option, consider:

are the main takeaways clear (e.g. proportions different enough)?
avoiding lots of wedges
aggregating if there are many tiny ones
emphasizing most important wedge
labeling directly on the chart
comparing to a bar chart version to see which is a better version

Pie chart alternative: treemap

As an alternative to a pie chart, consider treemaps. Treemaps display hierarchical data as a set of nested rectangles – simpler versions can be used to display parts of a whole using rectangles (which are easier for us to estimate than angles).

Source: From Data to Viz

Source: {treemapify} pkgdown site

3D charts distort perspective

Occlusion: When we see one object occlude (aka obstruct) another on a 2D surface, our brain perceives the object being hidden as farther away:

3D charts distort perspective

Perspective distortion: When we view objects in 3D, the objects farther away appear smaller, but our brain perceives them to be of larger size than in the picture:

Avoid gratuitous 3D

Consider how gray and blue areas visually compare in the 3D version? What about gray and orange? Now how do your interpretations change when inspecting the 2D version?

The pie chart on the right is an example of using 3D purely for decorative purposes. Here, the third dimension doesn’t actually convey any additional data. Claus Wilke calls this gratuitous 3D, and you should always avoid it.

Avoid 3D position scales

A plot with three genuine position scales (x, y, and z) to represent mtcars data (viewed from four different perspectives:

Alternative (a) to 3D position scales

If we primarily care about fuel efficiency as the response variable, plot it twice (once against displacement and once against power):

Alternative (b) to 3D position scales

If we are more interested in how displacement and power relate to each other, with fuel efficiency as a secondary variable of interest, create a bubble chart (plot power vs. displacement and map fuel efficiency onto the size of the dots). Be mindful that three variables (even in a 2D space) are still challenging for readers to quickly comprehend.

Are there ever opportunties to bend / break the rules & guidelines for creating “good” data viz?

Breaking the rules is sometimes okay

Data visualization is both a science and an art. Following these rules / best practices can help us avoid common pitfalls and avoid creating objectively difficult-to-interpret data visualizations.

However, there are arguments for bending (or breaking) the rules every now and again. Consider the following posts:

Why you sometimes need to break the rules in data viz, by Rosamund Pearce
Master the rules - then break them, by Dieuwertje van Dijk
Does Data Visualization Have Rules? Or Is It All Just “It Depends”?, by Nick Desbarats

Breaking the rules is sometimes okay

Award-winning data visualization by Simon Scarr (left), and a copy / remake of that visualization which follows the rules, created by Andy Cotgreave (right).

Image & caption source: Master the rules - then break them

Let’s consider some example data visualizations together

CO₂ in conference rooms

Clearing the Air, by Christopher Ingraham, writing for The Washington Post

Take some time to discuss the following:

where are your eyes drawn first, second, etc.?
what are the main messages / takeaways?
where has the author chosen to simplify this visualization (i.e. reduce extraneous elements)? does it make it easier / more challenging to interpret?
what would you change about this visualization?

02:00

Annotations adapted from @chezVoila

Palmer penguin classification

Perfectly Proportional Penguins, by Cara Thompson as part of TidyTuesday (code)

Take some time to discuss the following:

where are your eyes drawn first, second, etc.?
what are the main messages / takeaways?
where has the author chosen to simplify this visualization (i.e. reduce extraneous elements)? does it make it easier / more challenging to interpret?
what would you change about this visualization?

02:00

Glimmers of hope in large carnivore recovery

Fig. 3, by Ingeman et al. 2022: Glimmers of hope and critical cases. Distribution of large carnivore species across categories of current IUCN status (x-axis) and population trend (y-axis). Improvements in status are indicated by gold and declines by blue, with bubble size indicating the number of status category changes. The majority of species have not undergone any changes in status (shown in light gray). Note: No change in status may indicate lack of recent assessment, insufficient data, or, in the case of species designated Least Concern, effective conservation efforts.

Take some time to discuss the following:

where are your eyes drawn first, second, etc.?
what are the main messages / takeaways?
where has the author chosen to simplify this visualization (i.e. reduce extraneous elements)? does it make it easier / more challenging to interpret?
what would you change about this visualization?

02:00

Take a Break

~ This is the end of Lesson 1 (of 3) ~

05:00