knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The purpose of factcuratoR is to provide sets of functions to help standardize variety testing data for curation of variety testing data sets supporting the WAVE project.

The goal of the WAVE program curation is to generate trial data and trial metadata that conform to the controlled vocabulary codebooks.

The codebooks specify the variable required for each file (as a single column in a data frame, matrix, tibble, etc) and the accepted values for the the variable. See the Introduction to codebooks vignette for more information on the structure of and formatting for the codebooks.

Overview of validation functions

The validation functions check that the data conforms to the controlled vocabulary codebooks. There are functions to validate:

Finally, standardize_cols_by_cb() enables curators to standardize the files (select and order columns) according to the standards established in the codebooks.

First, load factcuratoR and point to the main codebook which must be named codebooks_all_db.csv for the validation functions to find it.

library(factcuratoR)

rlang::check_installed("here")

codebook_folder <- here::here(
  "tests/testthat/test_controlled_vocab")

knitroutputfolder <- here::here("inst/extdata/intro_validation", "output")

Create some test data

test_data <- data.frame(location = c("Aberdeen", "Soda Springs", NA, "location_x"),
                        year = c(rep(2020, 3), NA),
                        variety = c("variety_1", "AAC Wildfire", "", NA),
                        rep_temp = 1:4,
                        sourcefile = "test")

Validate trial data

It may be easiest to point the most cleaned up version of the data to a new variable (e.g. df_validate below) so that the calls to the validation functions don't need to be updated every time there is a more cleaned up version of the data.

df_validate <- test_data

Check column names

The function validate_colnames() will check the variable names between the codebook and data. The argument codebook_name = "trial_data" indicates what codebook from the codebooks_all_db.csv should be used for this step.

colname_valid <- validate_colnames(df_validate, 
                                   codebook_name = "trial_data", 
                                   db_folder = codebook_folder)

The main goal of validating column names is to get the value in the required column of row 3 to be zero. A value of zero indicates that all the required columns are present in the data and completely filled out. The second row in the summary highlights columns that are not in the codebook.

If required columns appear missing in the initial check, it may be present in the data set with a differing name. Often this is the case for many variables; collaborators each have their own unique name for common variables. When this happens, creating a file to rename files en masse is the best solution. See facthelpeR for functions to help with renaming columns. Determining if a variable is captured by the existing controlled vocabulary is a human decision made based on your own knowledge and when needed, consultation with other project participants. When in doubt, ask.

If a required variable is truly missing, then you will need to find the data. Looking at annual reports is a good starting point, and you may have to eventually request this information from the collaborator if you cannot find it elsewhere.

Sometimes, there are also new variables not captured by the existing vocabulary. In that case, we will need to decide if this variable should be added to the database. The answer is usually "yes", but usually a discussion with project participants is needed for this decision. If there is a decision to add a new variable, add it to controlled vocabulary using the correct formatting described in the Introduction to codebooks vignette.

Full report of validating column names

knitr::kable(colname_valid)

To interact with the variable names that still need to be fixed

colname_valid_check <- colname_valid %>% 
  filter(comment == "not present in codebook: trial_data")

col_info <- find_col_info(df_validate, 
                       cols_check = colname_valid_check$colname_data, 
                       by_col = sourcefile)

knitr::kable(col_info)

Check column contents

The function confront_data() is a wrapper around validate::confront() that checks that column contents match controlled vocabularies or are in the accepted value range.

Note: The argument blends = TRUE will check for blends in the variety column

colcontent_valid <- confront_data(df_validate, 
                                  df_type = "trial_data",
                                  db_folder = codebook_folder)

The summary output for validating column contents reports the number of columns that have fails (that is, a column does not match controlled vocabularies or is not within the accepted range) and NA values. A column will return error = TRUE if a column is not present in the data.

The goal of validating column contents is to achieve zero fails for all columns and for required columns, achieve zero NA values (so that each observation has a value).

The errors can be fixed by either fixing errors or standardizing contents in the raw data or adding new controlled vocabularies to the codebooks. For example, if the name "variety_1" is a real variety name, then it should be added to the cultivar codebook.

Full report of validating column contents

colcontent_summary <- colcontent_valid[["summary"]]
knitr::kable(colcontent_summary)

To check for validation fails interactively

var <- c("variety")
colcontent_violate <- 
  validate::violating(test_data, colcontent_valid[[2]][var]) %>% 
  relocate(matches(var))

knitr::kable(colcontent_violate)

Validate metadata

The steps in validating the metadata is the same as validating the trial data, except the argument codebook_name = "trials_metadata".

metadata_validate <- test_data

Check variable names

metadata_colname_valid <- 
  validate_colnames(
    metadata_validate, 
    "trials_metadata",
    db_folder = codebook_folder) %>%
    select(comment, colname_data, colname_codebook, required, col_num) 

knitr::kable(metadata_colname_valid)

Check variable values

metadata_colcontent_valid <- confront_data(metadata_validate, 
                                           df_type = "trials_metadata",
                                           db_folder = codebook_folder)

metadata_colcontent_summary <- metadata_colcontent_valid[["summary"]]

knitr::kable(metadata_colcontent_summary)

#Check for any fails interactively:

metadata_var <- c("location")
metadata_colcontent_violate <- 
  validate::violating(test_data, metadata_colcontent_valid[[2]][metadata_var]) %>% 
  relocate(matches(var))

knitr::kable(metadata_colcontent_violate)


IdahoAgStats/factcuratoR documentation built on Nov. 15, 2024, 11:11 a.m.