Background
By now, you may have heard / read something like, “Data scientists spend 80% of their time preparing their data for analysis and / or visualization.” And while that may not be totally accurate for all data scientists or all projects, you will spend lots of time wrestling with data. You’ll spend this week’s discussion cleaning up a messy data set on hydraulic fracturing (aka fracking), with the goal of (re)familiarizing yourselves with some commonly-used tidyverse functions.
This week’s data comes courtesy of Jeremy Singer-Vine’s Data is Plural weekly newsletter of useful / curious data sets (the 2023.09.27 edition). Singer-Vine’s description:
Since launching in 2011, FracFocus has become the largest registry of hydraulic fracturing chemical disclosures in the US. The database, available to explore online and download in bulk, contains 210,000+ such disclosures from fracking operators; it details the location, timing, and water volume of each fracking job, plus the names and amounts of chemicals used. The project is managed by the Ground Water Protection Council, “a nonprofit 501(c)6 organization whose members consist of state ground water regulatory agencies”. As seen in: The latest installment of the New York Times’ Uncharted Water series.
Interested in reading more about fracking? Check out this communications piece from USGS to start.
Exercise
It’s up to you to organize your own week1-discussion.qmd
file (i.e. there is no template). You may (should) discuss and work through today’s exercise with a partner (or two!).
Your goal is to transform our raw fracking data into a cleaned / wrangled data frame that looks like this (only the first 6 rows are shown here):
Load the
{tidyverse}
,{janitor}
, and{usdata}
, read inregistryupload_1.csv
, and perform some basic data exploration. Use the{here}
package to read in your data and check out things like the dimensions, structure, types of variables, unique observations, etc. Read through the metadata, which accompanied ourregistryupload_1.csv
file when downloaded from FracFocus.Consider what data cleaning you might need to perform. Discuss any curious data formatting (e.g. data types, inconsistent observation names, missing values, etc.) that you discovered during your data exploration step, above. What might you need to address in your next data wrangling steps?
Take a moment to share some of your findings with the rest of the class!
- Clean / wrangle the data! Using the packages loaded in step 1, perform the following operations (NOTE: Each step below corresponds to a single line of code):
- convert all column names from CamelCase to snake_case (for ease of readability and to adhere to the Tidyverse style guide recommendation)
- remove
"AM"
from thejob_start_date
column observations - convert observations in the
job_start_date
column from character strings to datetime objects and save them to a new column nameddatetime_start
- add a column named
year
that contains only the year fromdatetime_start
- keep only the necessary columns (
datetime_start
,year
,state_name
,well_name
,total_base_water_volume
) - convert all observations in the
state_name
column to Title Case - remove any rows where the state name ends in
?
- convert any state abbreviations to full names in the
state_name
column (Hint: use a combination of themutate()
,ifelse()
&usdata::abbr2state()
functions) - correct any misspelled state names in the
state_name
column (Note: figure out how to update the names of a few misspelled states, the feel free to copy the complete code for this step from the expandable section, below) - remove any rows that do not have a true US state name in the
state_name
column - add a column named
state_abb
with just US state abbreviations - move the
state_abb
column so that it sits immediately after thestate_name
column - rename
total_base_water_volume
astotal_base_water_volume_gal
for clarity - remove any observations that don’t have a measurement for
total_base_water_volume_gal
(i.e. NA) - reorder rows from earliest to most recent
datetime_start
values
mutate(state_name = case_when(
state_name == "Colordao" ~ "Colorado",
state_name == "Loiusiana" ~ "Louisiana",
state_name == "Louisianna" ~ "Louisiana",
state_name == "Lousiana" ~ "Louisiana",
state_name == "New Mexcio" ~ "New Mexico",
state_name == "Norh Dakota" ~ "North Dakota",
state_name == "Norht Dakota" ~ "North Dakota",
state_name == "North Dakota" ~ "North Dakota",
state_name == "North Dakata" ~ "North Dakota",
state_name == "North Dakotta" ~ "North Dakota",
state_name == "Noth Dakota" ~ "North Dakota",
state_name == "Pennslvania" ~ "Pennsylvania",
state_name == "Pennsylavania" ~ "Pennsylvania",
state_name == "Pennsylvanya" ~ "Pennsylvania",
state_name == "Penssylvania" ~ "Pennsylvania",
state_name == "Texasa" ~ "Texas",
state_name == "Texs" ~ "Texas",
state_name == "West Viginia" ~ "West Virginia",
state_name == "Wyominng" ~ "Wyoming",
TRUE ~ state_name # copy over rest of state names from as-is
))