Artwork by Allison Horst
EDS 240: Discussion 1
Data Wrangling
Week 1 | January 7th, 2024
What do we mean by “data wrangling?”
“Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one ”raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.”
Wrangling includes (but is not limited to):
Wrangling is a critical first step in building any sort of data visualization!
You may have heard something like, “Data scientists spend 80% of their time preparing their data for analysis and / or visualization.” And while that may not be totally accurate for all data scientists or all projects, you will spend lots of time wrestling with data.
{ggplot2}
plays best with tidy data
Artwork by Allison Horst
Reminder: tidy data is not a {ggplot2}
-specific concept. It’s a broadly standardized way of organizing data.
The {tidyverse}
provides lots of helpful tools
The data science workflow, as described by Hadley Wickham, Mine Çetinkaya-Rundel and Garrett Grolemund in R for Data Science (2e), with added {tidyverse}
packages as they fit within this workflow.
Note that there are a number of other non-tidyverse packages that are also incredibly helpful too (e.g. {janitor}
, {naniar}
)!
Let’s wrangle some fracking data
Since launching in 2011, FracFocus has become the largest registry of hydraulic fracturing chemical disclosures in the US. The database, available to explore online and download in bulk, contains 210,000+ such disclosures from fracking operators; it details the location, timing, and water volume of each fracking job, plus the names and amounts of chemicals used. The project is managed by the Ground Water Protection Council, “a nonprofit 501(c)6 organization whose members consist of state ground water regulatory agencies”. As seen in: The latest installment of the New York Times’ Uncharted Water series.
-Jeremy Singer-Vine on Data is Plural (2023.09.27 edition)
Interested in reading more about fracking? Check out this communications piece from USGS.
Download fracking data from Google Drive!
You should already have downloaded these data from Google Drive
I happened to snag these data back in November 2023 when they were still quite messy. Since then, (it seems that) FracFocus has done a bit more pre-processing of these data – meaning the data you download from their online portal is already a whole lot cleaner. This is great(!), but also defeats the purpose of this exercise .
Open up the Week 1 Discussion: Exercise for instructions / next steps.