knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

facthelpeR is a set of functions that are used to read in and do a first pass at cleaning up data.

Load facthelpeR

library(facthelpeR)
library(here)

List the file structure of the data

auxiliary_files is a folder that contains helper files that are used in the curation process. Generally, the *_aux.csvs are manually edited to match controlled vocabularies. They live in a separate folder than the output so that they don't get written over accidentally.

main_folder <- "inst/extdata/OSU_2019_example"
data_folder <- here::here(main_folder, "data-raw")
auxiliary_files <- here::here(main_folder, "auxiliary_files") 
knitroutput_folder <- here::here(main_folder, "output")

Manual edits

If there are any mistakes in numeric values or one-off typos, sometimes it is faster to fix it in the raw files. Note any manual changes here in the .Rmd for reproducibility.

Read in raw data

Generate a list of all of the files and sheets

If the header is not in the first row (or spans multiple rows), write.csv(sheet_info, ...) and manually fill in the number of rows to skip in each sheet. This data.frame will be read into read_multsheets()

files_csv <- list.files(data_folder, recursive = TRUE, pattern = ".csv")

sheet_info_csv <- data.frame(filename = files_csv) %>% 
  filter(!str_detect(filename, "curation_files")) %>%
  mutate(list_names = basename(filename)) %>%
  mutate(list_names = str_remove(list_names, ".csv")) 

knitr::kable(sheet_info_csv)

How to handle sheets in .xls files

This section is just to demonstrate how to handle excel files with multiple sheets. The data won't be processed in the rest of this vignette.

files_xls <- list.files(data_folder, recursive = TRUE, pattern = ".xls")

sheet_info_xls <- list_sheetnames(data_folder, files_xls) 

knitr::kable(sheet_info_xls)

Use read.multsheets()

data_raw1 <- read.multsheets(data_folder, sheet_info_csv, na = c("NA"), col_names = TRUE)
data_raw2 <- imap(data_raw1, function(x, y){ x %>% mutate(sourcefile = y)})


knitr::kable(head(data_raw2[[1]]))
knitr::kable(head(data_raw2[[2]]))
#save(data_raw2,
 #    file = here::here(knitroutput_folder, "data_raw2.RData"))

Remove duplicate columns

Use facthelpeR::rm_dup_col() because it can detect and remove duplicates when: - column names are not exact, but column contents are - if the only difference between two column contents are that one has some NAs, then that column is removed and the other is retained

data_raw3 <- imap(data_raw2, function(x,y){
    print(y)
    rm_dup_col(x, FALSE) %>% 
    select(where(~!all(is.na(.))))
})

Curate column names:

Get column names

colnames <- summarize_variables(data_raw3, reg_ex = NULL)
colnames_summary <- colnames %>%
  group_by(colname) %>%
  summarize(across(everything(), ~ paste(na.omit(unique(.x)), collapse = "; ")))

write.csv(colnames_summary, here::here(knitroutput_folder, "colnames_summary.csv"), row.names = FALSE)

knitr::kable(colnames_summary)

Rename columns using auxiliary files

First, understand what each column means (description and units) by:

  1. checking through annual reports

  2. checking with Julia

  3. lastly, asking the PI

Then, fill out the auxiliary file with the appropriate controlled vocabulary.

If there isn't a corresponding controlled vocabulary, then need to:

  1. decide if the column can be captured by current controlled vocabulary. If so, standardize to the controlled vocabulary

  2. decide whether we need to keep the variable at all (can fill in auxiliary file with "rm_col" in the controlled_vocab column and rename.col() will remove them if rm_col argument = TRUE)

  3. decide whether a new controlled vocabulary should be added. If so, add to controlled vocabulary using the correct formatting, which is described in the README (I usually get Julia's opinion on this, especially before adding new vocabularies)

colnames_curation <- 
  read.csv(here::here(auxiliary_files, "colnames_summary_aux.csv")) %>%
  # If a column name still needs to be checked/confirmed, leave the controlled_vocab 
  # column blank in the "colnames_summary_aux.csv" file.  
  # Before the renaming step, just set the unknowns to the 
  # original name.
  mutate(controlled_vocab = 
           ifelse(controlled_vocab == "", colname, controlled_vocab)) 

knitr::kable(colnames_curation)

data_raw_rename1 <- rename_col(data_raw3, 
                              rename_df = colnames_curation, 
                              rename_col = controlled_vocab, 
                              old_col = colname, 
                              rm_col = TRUE)

knitr::kable(head(data_raw_rename1[[1]]))
knitr::kable(head(data_raw_rename1[[2]]))

save(data_raw_rename1,
     file = here::here(knitroutput_folder, "data_raw_rename1.RData"))

Some functions that may help explore the columns within the list of data

There are often instances where column names are not used consistently between files or there are similar column names within one dataset (test vs. testwt). The following functions may help to summarize and explore the data.

List the number of columns that match a given regex per dataset.

cols_summary <- summarize_ncol(data_raw_rename1, c("entry", "length"))
knitr::kable(cols_summary)

List the proportion of files that contain a given

cols_prop <- summarize_colnames(data_raw_rename1, c("entry", "length"))
knitr::kable(cols_prop)

Pull all the columns that match a given regex. This is useful for checking to see whether column names have been used consistently within and across datasets.

cols <- select_colsfromlist(data_raw_rename1,  c("entry", "length"))
knitr::kable(cols)

Bind all data.frames

data_all1 <- bind_rows(data_raw_rename1)
save(data_all1,
     file = here::here(knitroutput_folder,"data_all1.RData"))

Curate column contents

See the factcuratoR package for validating columns and column contents



IdahoAgStats/facthelpeR documentation built on Jan. 29, 2024, 9:22 a.m.