Week 1 Discussion: Exercise

Note

It’s up to you to organize your own week1-discussion.qmd file (i.e. there is no template). You may (should) discuss and work through today’s exercise with a partner (or two!).

Your goal is to transform the raw fracking data into a wrangled data frame that looks like this (only the first 6 rows are shown here):

Load the {tidyverse}, {janitor}, and {usdata} packages, read in registryupload_1.csv, and perform some basic data exploration. Use the {here} package to read in your data and check out things like the dimensions, structure, types of variables, unique observations, etc. Read through the metadata, which accompanied our registryupload_1.csv file when downloaded from FracFocus.
Consider what data cleaning you might need to perform. Discuss any curious data formatting (e.g. data types, inconsistent observation names, missing values, etc.) that you discovered during your data exploration step, above. What might you need to address in your next data wrangling steps?

Pause for class discussion

Take a moment to share some of your findings with the rest of the class.

Clean / wrangle the data! Using the packages loaded in step 1, perform the following operations (NOTE: Each step below corresponds to a single line of code):
1. convert all column names from CamelCase to snake_case (for ease of readability and to adhere to the Tidyverse style guide recommendation)
2. remove "AM" from the job_start_date column observations
3. convert observations in the job_start_date column from character strings to datetime objects and save them to a new column named datetime_start
4. add a column named year that contains only the year from datetime_start
5. keep only the necessary columns (datetime_start, year, state_name, well_name, total_base_water_volume)
6. convert all observations in the state_name column to Title Case
7. remove any rows where the state name ends in ?
8. convert any state abbreviations to full names in the state_name column (Hint: use a combination of the mutate(), ifelse() & usdata::abbr2state() functions)
9. correct any misspelled state names in the state_name column (Note: figure out how to update the names of a few misspelled states, the feel free to copy the complete code for this step from the expandable section, below)
10. remove any rows that do not have a true US state name in the state_name column
11. add a column named state_abb with just US state abbreviations
12. move the state_abb column so that it sits immediately after the state_name column
13. rename total_base_water_volume as total_base_water_volume_gal for clarity
14. remove any observations that don’t have a measurement for total_base_water_volume_gal (i.e. NA)
15. reorder rows from earliest to most recent datetime_start values

Expand for code to fix misspelled state names

mutate(state_name = case_when(
    state_name == "Colordao" ~ "Colorado",
    state_name == "Loiusiana" ~ "Louisiana",
    state_name == "Louisianna" ~ "Louisiana",
    state_name == "Lousiana" ~ "Louisiana",
    state_name == "New Mexcio" ~ "New Mexico",
    state_name == "Norh Dakota" ~ "North Dakota",
    state_name == "Norht Dakota" ~ "North Dakota",
    state_name == "North  Dakota" ~ "North Dakota",
    state_name == "North Dakata" ~ "North Dakota",
    state_name == "North Dakotta" ~ "North Dakota",
    state_name == "Noth Dakota" ~ "North Dakota",
    state_name == "Pennslvania" ~ "Pennsylvania",
    state_name == "Pennsylavania" ~ "Pennsylvania",
    state_name == "Pennsylvanya" ~ "Pennsylvania",
    state_name == "Penssylvania" ~ "Pennsylvania",
    state_name == "Texasa" ~ "Texas",
    state_name == "Texs" ~ "Texas", 
    state_name == "West Viginia" ~ "West Virginia",
    state_name == "Wyominng" ~ "Wyoming", 
    TRUE ~ state_name # copy over rest of state names from as-is
  ))