It’s up to you to organize your own week1-discussion.qmd
file (i.e. there is no template). You may (should) discuss and work through today’s exercise with a partner (or two!).
Your goal is to transform our raw fracking data into a cleaned / wrangled data frame that looks like this (only the first 6 rows are shown here):
Load the
{tidyverse}
,{janitor}
, and{usdata}
, read inregistryupload_1.csv
, and perform some basic data exploration. Use the{here}
package to read in your data and check out things like the dimensions, structure, types of variables, unique observations, etc. Read through the metadata, which accompanied ourregistryupload_1.csv
file when downloaded from FracFocus.Consider what data cleaning you might need to perform. Discuss any curious data formatting (e.g. data types, inconsistent observation names, missing values, etc.) that you discovered during your data exploration step, above. What might you need to address in your next data wrangling steps?
Take a moment to share some of your findings with the rest of the class!
- Clean / wrangle the data! Using the packages loaded in step 1, perform the following operations (NOTE: Each step below corresponds to a single line of code):
- convert all column names from CamelCase to snake_case (for ease of readability and to adhere to the Tidyverse style guide recommendation)
- remove
"AM"
from thejob_start_date
column observations - convert observations in the
job_start_date
column from character strings to datetime objects and save them to a new column nameddatetime_start
- add a column named
year
that contains only the year fromdatetime_start
- keep only the necessary columns (
datetime_start
,year
,state_name
,well_name
,total_base_water_volume
) - convert all observations in the
state_name
column to Title Case - remove any rows where the state name ends in
?
- convert any state abbreviations to full names in the
state_name
column (Hint: use a combination of themutate()
,ifelse()
&usdata::abbr2state()
functions) - correct any misspelled state names in the
state_name
column (Note: figure out how to update the names of a few misspelled states, the feel free to copy the complete code for this step from the expandable section, below) - remove any rows that do not have a true US state name in the
state_name
column - add a column named
state_abb
with just US state abbreviations - move the
state_abb
column so that it sits immediately after thestate_name
column - rename
total_base_water_volume
astotal_base_water_volume_gal
for clarity - remove any observations that don’t have a measurement for
total_base_water_volume_gal
(i.e. NA) - reorder rows from earliest to most recent
datetime_start
values
mutate(state_name = case_when(
state_name == "Colordao" ~ "Colorado",
state_name == "Loiusiana" ~ "Louisiana",
state_name == "Louisianna" ~ "Louisiana",
state_name == "Lousiana" ~ "Louisiana",
state_name == "New Mexcio" ~ "New Mexico",
state_name == "Norh Dakota" ~ "North Dakota",
state_name == "Norht Dakota" ~ "North Dakota",
state_name == "North Dakota" ~ "North Dakota",
state_name == "North Dakata" ~ "North Dakota",
state_name == "North Dakotta" ~ "North Dakota",
state_name == "Noth Dakota" ~ "North Dakota",
state_name == "Pennslvania" ~ "Pennsylvania",
state_name == "Pennsylavania" ~ "Pennsylvania",
state_name == "Pennsylvanya" ~ "Pennsylvania",
state_name == "Penssylvania" ~ "Pennsylvania",
state_name == "Texasa" ~ "Texas",
state_name == "Texs" ~ "Texas",
state_name == "West Viginia" ~ "West Virginia",
state_name == "Wyominng" ~ "Wyoming",
TRUE ~ state_name # copy over rest of state names from as-is
))