# knitr options
knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  echo = TRUE,
  comment = "#>"
)

Introduction

Welcome to the DASH R package! This vignette, which logically serves as the first of multiple vignettes in the package, describes how to import and quality control (QC) on-the-ground (OTG hereafter) collected habitat data using the DASH protocol. The initial data collection forms to record the OTG data were generated using ArcGIS Survey123 and we focus on importing data from those forms. After data import, we describe how functions included in the DASH R package can then be used to perform QC on the data to identify potential errors, perhaps that occurred in the field. Finally, we describe initial data cleaning and joining (i.e., joining channel unit scale information with wood, undercut, etc. information collected for that unit) in preparation of attaching the OTG collected data to stream centerline data to make it spatial. Later, our spatial OTG data can then be connected to drone-generated orthomosaics.

We'll start by describing the import procedure for OTG data; we'll then go into QC of that data and how to resolve some of the identified errors. Begin by loading the necessary libraries (R packages) needed for this vignette and analysis:

# load necessary libraries
library(DASH)
library(tibble)
library(janitor)
library(dplyr)
library(magrittr)

Data Import

Each DASH OTG survey conducted using ArcGIS Survey123 data collection forms and for a site typically consists of 7 comma-delimited (.csv) files:

Each of the .csv files can be joined or related to each other using the GlobalID and ParentGlobalID columns in each. For example, the ParentGlobalID in CU_1.csv can be used to related it back to a GlobalID in surveyPoint_0.csv, whereas the GlobalID in CU_1.csv is a unique identifier for each channel unit. Similarly, the ParentGlobalID in Wood_2.csv can be used to join it back to a GlobalID in CU_1.csv. Note that the surveyPoint_0.csv file does not contain a ParentGlobalID column as it does not have a "parent" (i.e., it is the parent).

To start, set otg_path a path to the directory containing the ArcGIS Survey123 folder of interest. otg_path should only contain folders (no files), and each of those folders should only contain the above files (and no folders).

# As an example, if the Survey123 site folders were in a "/1_formatted_csvs/" folder on your Desktop
otg_path = "C:/Users/username/Desktop/1_formatted_csvs/"

# For internal Biomark folks, set nas_prefix based on your operating system and then set otg_path to the NAS
if(.Platform$OS.type != 'unix') {
  nas_prefix = "S:"
} else if(.Platform$OS.type == 'unix') {
  nas_prefix = "~/../../Volumes/ABS"
}

# a theoretical path on the NAS
otg_path = paste0(nas_prefix, "/data/habitat/DASH/OTG/1_formatted_csvs/")

read_otg_csv(): Reading in One Type of OTG Data at a Time

The DASH R package includes a function read_otg_csv() which can be used to import one type (otg_type) of OTG data at a time. read_otg_csv() includes two arguments, path and otg_type. Here, we provide examples of how read_otg_csv() can be used to import channel unit (CU) or wood data to create two objects, otg_raw_cu and otg_raw_wood, respectively. In our case, we'll set path = otg_path which we created above, and otg_type is set using the name of the files containing the OTG data of interest.

read_otg_csv() performs some QC during the data import process. First, read_otg_csv() ensures that the column names and specifications in the file being imported matches the expected, which is defined in get_otg_col_specs(). If the column names or specifications do not match the expected, read_otg_csv() will provide a warning indicating so, but will attempt to proceed. The function also performs one additional QC ensuring that the file being imported contains the expected number of columns and will provide a "fatal error" if not and stop the import process. Throughout the process, you will see which file is being imported and whether the import was successful or not. Finally, read_otg_csv() will provide a message if no records are found in a file, which can be totally fine, for example, if a site contained no woody debris, jams, undercuts, etc. In this case, read_otg_csv() will just provide a friendly message and proceed.

# read in just one type of OTG data at a time
otg_raw_cu = read_otg_csv(path = otg_path,
                          otg_type = "CU_1.csv")

otg_raw_wood = read_otg_csv(path = otg_path,
                            otg_type = "Wood_2.csv")

During the initial data import process, you may identify files that aren't formatted correctly for import and read_otg_csv() will, rightly so, "choke" during import and indicate so. For example, a .csv file may have inadvertently been saved or exported as tab-delimited, or personnel may have unknowingly added or removed a column. In this case, we suggest storing all of the Survey123 site folders in two separate directories. I have been storing two versions, saved separately in folders "/0_raw_csvs/" and "/1_formatted_csvs/". Here, I keep the "raw" untouched data in "/0_raw_csvs/" which I never change. The "/1_formatted_csvs/" folder is where I make formatting changes that I've identified during the import process. It's typically good practice to always store the rawest form of your data that is never changed after data collection for archiving sake; hence, why we are importing data from a "/1_formatted_csvs/" folder.

Please note that at any time you may access the help menu and documentation for the package or functions using the following:

# for the package
help(package = "DASH")

# for a function
?read_otg_csv

read_otg_csv_wrapper(): Read in All OTG Data At Once

It would be a bit tedious to import each of our 7 OTG data types (otg_type), separately, and later join them together. To remedy this, we've included the function read_otg_csv_wrapper() in the DASH R package, which can be used to import some or all of the OTG data types at once. The result is a list of tibbles (i.e., simple data frames) each containing data for an otg_type. Rather than having the argument otg_type, read_otg_csv_wrapper() has the argument otg_type_list where the user can provide a vector of OTG data types to import.

Let's instead use read_otg_csv_wrapper() to import just our channel unit and wood data, similar to above, except now creating a list of tibbles, rather than two separate objects.

# read in some (channel unit and wood) data types at once using read_otg_csv_wrapper()
otg_raw = read_otg_csv_wrapper(path = otg_path,
                               otg_type_list = c("CU_1.csv",
                                                 "Wood_2.csv"))

Or better yet, let's read in all OTG data types at once, which, by the way, is the default setting for read_otg_csv_wrapper(). Note that the function includes an argument otg_type_names which can be used to rename each of the tibbles during import, which is much tidier than using the file names to name each tibble (as we did above without otg_type_names). If otg_type_names is provided by the user, it must be the same length as otg_type_list.

# "loop over" all data types using read_otg_csv_wrapper()
otg_raw = read_otg_csv_wrapper(path = otg_path,
                               otg_type_list = c("surveyPoint_0.csv",
                                                 "CU_1.csv",
                                                 "Wood_2.csv",
                                                 "Jam_3.csv",
                                                 "Undercut_4.csv",
                                                 "Discharge_5.csv",
                                                 "DischargeMeasurements_6.csv"),
                               otg_type_names = c("survey",
                                                  "cu",
                                                  "wood",
                                                  "jam",
                                                  "undercut",
                                                  "discharge",
                                                  "discharge_measurements"))

Congratulations! You now (hopefully) have successfully imported your ArcGIS Survey123 OTG data using the read_otg_csv() or read_otg_csv_wrapper() (preferred) functions. Either that, or you're muttering obscenities under your breath, in which case you're welcome to either submit an Issue on the DASH website or contact the author at the e-mail address provided above.

Example Data

We've included an example "raw" OTG dataset with the package called otg_raw which was imported using read_otg_csv_wrapper(). otg_raw includes data for just r nrow(otg_raw$survey) sites. Let's explore otg_raw a bit; you can use $ to extract elements of a list.

# a summary of the cu data (just the first 10 columns; exclude [,1:10] to see all)
summary(otg_raw$cu[,1:10])

# look at first 5 records of wood data using head()
head(otg_raw$wood)

Excellent, now let's save our data to an (imaginary) directory for later use. Next, we'll perform quality control (QC) on that OTG data.

# save to the otg_path
save(otg_raw,
     file = paste0(otg_path, "otg_raw.rda"))

Data QC

It's now time to perform QC on the imported raw data to identify errors that may have occurred in the field during data collection. The goal is to identify as many potential errors in the data as possible (but perhaps not all!) using functions provided by the DASH R package; and save or export a list of the potential errors. The identified errors can then be reviewed by field personnel or someone closely familiar with the data and remedied where possible.

Please note that during the QC process or while working with the data, you may identify potential errors in the data not identified by functions provided in DASH. In this case, please take the time to submit an Issue on the site and we will try to add the functionality to "flag" or identify those potential issues in the future. We are always looking to improve the QC process.

Let's start by loading the otg_raw object that we previously saved, using the load() function:

load(paste0(otg_path, "otg_raw.rda"))

Recognize that you don't need to load otg_raw if it's already in your environment. But it is good practice to save some of your data objects at particular "checkpoints" during analysis (e.g., after finishing the data import process). load() is then useful in case you had to step away for awhile (i.e., close your script, analysis, project) and come back later.

To QC the OTG data, DASH includes 7 QC functions as follows (including a description of the QC checks that each performs), each corresponding to an otg_type:

In addition to the above, each QC function also 1) identifies whether the column names in qc_df match the expected defined in get_otg_cols_specs(), and 2) checks whether there are any NA or in important columns defined using the cols_to_check_nas argument (where applicable).

QC One otg_type at a Time

We can QC a single otg_type at a time using any number of the above QC functions. As an example, let's use qc_cu() and qc_wood() to QC the otg_type = "CU_1.csv" and otg_type = "Wood_2.csv data, respectively. The output from all of the QC functions has the same format and so can be easily joined, which we also demonstrate. Many of the QC functions include an argument cols_to_check_nas which can be used to define columns where you want to flag NAs or values, which can be useful for columns where you expect every record to have a value. Each QC function with the cols_to_check_nas argument has a default which will be used if the user does not define the argument.

# QC just the channel unit data using "qc_cu()", define cols_to_check_nas
init_qc_cu = qc_cu(qc_df = otg_raw$cu,
                   cols_to_check_nas = c("Channel Unit Type",
                                       "Channel Unit Number",
                                       "Channel Segment Number",
                                       "Maximum Depth (m)",
                                       "ParentGlobalID"))

# QC just the wood data using "qc_wood()", use default cols_to_check_nas
init_qc_wood = qc_wood(qc_df = otg_raw$wood)

# for example, bind the above results together
init_qc_some = init_qc_cu %>%
  add_column(source = "CU",
             .before = 0) %>%
  bind_rows(init_qc_wood %>%
              add_column(source = "Wood",
                         .before = 0))

qc_wrapper(): QC All OTG Data At Once

For the sake of convenience, we also include a qc_wrapper() function that "loops over" some or all of the 7 QC functions, similar to the read_otg_csv_wrapper() function we used earlier during data import. In addition, qc_wrapper() also combines the results from each of the QC's to a single data.frame.

Similar to above, let's just QC the otg_type = "CU_1.csv" and otg_type = "Wood_2.csv" data, except using the qc_wrapper() function. We also demonstrate how to use qc_wrapper() to QC every otg_type at once and saving it to an object init_qc. The qc_wrapper() function will provide a series of messages while performing the QC providing periodic updates; the function includes the argument redirect_output, which if set to FALSE will write those messages to the console. However, if redirect_output = TRUE, those messages can be written to a file using the additional argument redirect_output_path.

# QC just channel unit and wood data
init_qc_some = qc_wrapper(survey_df = otg_raw$survey,
                          cu_df = otg_raw$cu,
                          wood_df = otg_raw$wood,
                          redirect_output = FALSE)

# QC all OTG data types at once
init_qc = qc_wrapper(survey_df = otg_raw$survey,
                     cu_df = otg_raw$cu,
                     wood_df = otg_raw$wood,
                     jam_df = otg_raw$jam,
                     undercut_df = otg_raw$undercut,
                     discharge_df = otg_raw$discharge,
                     discharge_meas_df = otg_raw$discharge_measurements,
                     redirect_output = FALSE)

Save QC Results

At this point, it would be a good idea to write your initial QC results out to a file. I'd suggest writing out the QC results in the "/1_formatted_csvs/" directory containing the data that was QC'd. Following, we provide options to either 1) add it to your otg_raw object and re-save that object or 2) write out init_qc to its own .csv file. If you want to get fancy, you can also append today's date using the format() function to provide information on when the QC was performed.

# add QC results to otg_raw, and re-save
otg_raw$init_qc = init_qc
save(otg_raw,
     file = paste0(otg_path, "otg_raw.rda"))

# write QC results to .csv
write_csv(init_qc,
          paste0(otg_path, "init_qc.csv"))

# append today's date
write_csv(init_qc,
          paste0(otg_path, 
                 "init_qc_",
                 format(Sys.Date(), format = "%Y%m%d"),
                 ".csv")

Nice work! You just completed an initial QC on your OTG data and exported the results. Either that, or you're again muttering obscenities at the authors, at which point you're welcome to contact them at the above provided e-mail addresses or submit an Issue on the repo website. Let's move onto examining those QC results, and in some cases, resolving them.

Examine Resolve QC Errors

At this point in the DASH workflow, someone very familiar with the data, perhaps a field technician or coordinator, should review the QC errors, attempt to resolve those where possible, and ideally, make notes about each error regarding how they can or cannot be resolved. Notes on how QC errors were resolved, or not, can be useful towards improving data validation (e.g., during field data collection) or quality control steps in the future.

Similar to above, it is generally best practice to not modify the "raw" data i.e., in our case, do not make changes to the data in the "/1_formatted_csvs/" directory. Instead, it is a good idea to make another copy of your data, we suggest in a "/2_qcd_csvs/" directory. This is where we will make changes to the data to address issues identified during the QC. Retaining the "/1_formatted_csvs/" data "as is" preserves the integrity of the raw data and allows us to fairly easily identify any changes made to the data during the QC process. In addtion, by documenting changes that have been made to the data, we can identify common errors and perhaps add data validation during the data colleciton process to prevent those same errors in the future.

Note: We also previously saved a version of the data in a "/0_raw_csvs/" directory, which is actually the most raw version of the data. The data in "/0_raw_csvs/" and "/1_formatted_csvs/" directories only differ in that file formatting issues may have been identified during the data import process and resolved within the "/1_formatted_csvs/ version; however, no data has actually changed.

Examine QC

Of course, a user could examine the initial QC results init_qc by opening and reviewing a .csv file, which we showed how to write above. However, you might also find it useful to explore the QC results a little further using R. The following section is entirely optional, but provides some examples and guidance about how init_qc could be examined using R.

We can perform and initial review of our QC results using the following:

# use janitor::tabyl() to examine the "source" of the errors
tabyl(init_qc,
      source)

# simply print out the results
init_qc %>%
  head(5) %>%
  print()

Follows is an example of how one could examine a particular type of error. In this case, let's look at the "CU" data and look at those errors where error_message contains "Column Maximum Depth". We also join the source data (otg_raw$cu) back to the error message to look a little closer. The function dplyr::select() can be used to choose the columns to include in your results.

# examine channel unit errors where the error_message contains "Column Maximum Depth"
init_qc %>%
  filter(source == "CU") %>%
  filter(grepl("Column Maximum Depth", error_message)) %>%
  left_join(otg_raw$cu) %>%
  select(source:error_message, `Channel Unit Type`, `Maximum Depth (m)`)

During our initial review of DASH data, we commonly identified the following two errors: 1) the sum of ocular substrate estimates don't sum to 100 due to a math or recording error in the field and/or 2) the sum of fish cover estimates do not fall within an expected range (note: the sum of fish cover estimates can sum to greater than 100). We've included two functions within the package rescale_values() and fix_fish_cover() that can be used to resolve these common errors, respectively. We cover those functions in the following [section][Resolve Common Errors].

Examine the ocular substrate errors:

# ocular estimate errors where sum_ocular does not equal 100
init_qc %>%
  filter(source == "CU") %>%
  filter(grepl("Ocular estimates", error_message)) %>%
  left_join(otg_raw$cu) %>%
  select(source:GlobalID,
         `Channel Unit Type`,
         `Sand/Fines 2mm`:`Boulder 256mm`) %>%
  mutate(sum_ocular = rowSums(.[5:8], na.rm = T)) %>%
  head(5)

Examine the fish cover errors:

# the sum of fish cover values fall outside the expected range
init_qc %>%
  filter(source == "CU") %>%
  filter(grepl("Cover values", error_message)) %>%
  left_join(otg_raw$cu) %>%
  select(source:GlobalID,
         `Overhanging Cover`:`Total No Cover`) %>%
  mutate(sum_cover = rowSums(.[4:8], na.rm = T)) %>%
  head(5)

Here's one last example for exploring a somewhat common error. In this case, let's look at all errors in the undercut data filter(source == "Undercut"), join the source data left_join(otg_raw$undercut), and then reduce the number of columns included in the results select(path_nm:error_message, `Undercut Number`:`Width 75% (m)`). Note in this case we need to place backticks around the variable (column) names "Undercut Number" and "Width 75% (m)" because they contain invalid or special character (e.g., , %, ()).

# undercut errors; note many of these are flagged bc of a strange "Location" values
init_qc %>%
  filter(source == "Undercut") %>%
  left_join(otg_raw$undercut) %>%
  select(path_nm:error_message,
         `Undercut Number`:`Width 75% (m)`)

Resolve QC Errors

Time to resolve errors identified during QC process. We discuss suggested routes for resolving both uncommon and common. Common errors we've identified to date are related to ocular substrate estimates not summing to 100 and fish cover estimates not summing within an expected range.

Uncommon Errors {#uncommon-errors}

We suggest working on the uncommon errors first. Remember that as you review and work through the QC results, make any modifications to the .csv files within the "/2_qcd_csvs/" directory.

Common Errors {#common-errors}

After resolving the uncommon errors to the best of your ability, let's resolve the common errors using a couple functions provided in the DASH package. We'll first want to import our data from the "/2_qcd_csvs/" directory that contains data that has been modified to address uncommon errors identified during the QC process.

First, let's set a new path qcd_path to that data and import that version of the OTG data.

# As an example, a theoretical path to the Biomark NAS
qcd_path = paste0(nas_prefix, "/data/habitat/DASH/OTG/2_qcd_csvs/")

# "loop over" all data types using read_otg_csv_wrapper()
otg_qcd = read_otg_csv_wrapper(path = qcd_path,
                               otg_type_list = c("surveyPoint_0.csv",
                                                 "CU_1.csv",
                                                 "Wood_2.csv",
                                                 "Jam_3.csv",
                                                 "Undercut_4.csv",
                                                 "Discharge_5.csv",
                                                 "DischargeMeasurements_6.csv"),
                               otg_type_names = c("survey",
                                                  "cu",
                                                  "wood",
                                                  "jam",
                                                  "undercut",
                                                  "discharge",
                                                  "discharge_measurements"))

Fix ocular substrate estimates

The package includes the function DASH::rescale_values() that can be used to rescale a set of values to sum to a given sum_to = value. In our below example, we are going to rescale any case where the values in col_names sum between min_value = 90 and max_value = 110 so that they sum_to = 100. Presumably, errors like this occurs because a simple math or recording error was made in the field. Note the default values for min_value and max_value are actually set to 80 and 120 by default. Note that the rescale_values() function could theoretically be applied to any set of numerical columns in your OTG (or any other!) data.

# ocular substrate fixes
otg_qcd$cu = rescale_values(data_df = otg_qcd$cu,
                            col_names = c("Sand/Fines 2mm",
                                          "Gravel 2-64mm",
                                          "Cobble 64-256mm",
                                          "Boulder 256mm"),
                            min_value = 90,
                            max_value = 110,
                            sum_to = 100)

Fix fish cover estimates

We also include the DASH::fix_fish_cover() function which is a more specialized case of rescale_values() for fish cover estimates. In this case, fix_fish_cover will resolve any case where either fish cover estimates were accidentally recorded as proportions or fish cover estimates in the columns cover_cols sum to 90. Recall that fish cover estimates can sum to greater than 100, which is okay (e.g., both canopy and aquatic vegetation both cover a portion of a channel unit). Remember, that you can always use ?fix_fish_cover to access the help menu for a function.

# fish cover fixes
otg_qcd$cu = fix_fish_cover(cu_df = otg_qcd$cu,
                            cover_cols = c("Overhanging Cover",
                                           "Aquatic Vegetation",
                                           "Woody Debris Cover",
                                           "Artificial Cover",
                                           "Total No Cover"))

We can then now save our otg_qcd object which contains both the changes we made by hand in the .csv files in the "/2_qcd_csvs/" directory plus the changes we made using the rescale_values and fix_fish_cover functions. Note however that the changes made using these two function are not reflected in your .csvs. If you would like assistance on overwriting your .csvs to reflect these changes, please contact the vignette authors for assistance. Care should be taken when over-writing files.

save(otg_qcd,
     file = paste0(qcd_path, "otg_qcd.rda"))

Final QC

Now, it is probably good practice to re-run qc_wrapper() on the otg_qcd to ensure you have correctly resolved errors to the best of your ability.

# re-load otg_qcd, if needed
load(paste0(qcd_path, "otg_qcd.rda"))
# re-run qc_wrapper() on otg_qcd
qc_final = qc_wrapper(survey_df = otg_qcd$survey,
                      cu_df = otg_qcd$cu,
                      wood_df = otg_qcd$wood,
                      jam_df = otg_qcd$jam,
                      undercut_df = otg_qcd$undercut,
                      discharge_df = otg_qcd$discharge,
                      disch_meas_df = otg_qcd$discharge_measurements)

And finally, let's save our "final" QC results in the directory with the QC'd data.

# write QC results to .csv
write_csv(qc_final,
          paste0(qcd_path, "qc_final.csv"))

# alternatively, can be useful to save as a .rds, more on this later...
save(qc_final, file = paste0(qcd_path, "qc_final.rds"))

# one other option, append to otg_qcd list and overwrite
otg_qcd$qc_results = qc_final
save(otg_qcd,
     file = paste0(qcd_path, "otg_qcd.rda"))

END IMPORT AND QC VIGNETTE

Holy Guacamole! That deserves a beer!



mackerman44/DASH documentation built on Sept. 5, 2024, 8:14 a.m.