knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
facthelpeR is a set of functions that are used to read in and do a first pass at cleaning up data.
Load facthelpeR
library(facthelpeR) library(here)
auxiliary_files is a folder that contains helper files that are used in the curation process. Generally, the *_aux.csvs are manually edited to match controlled vocabularies. They live in a separate folder than the output so that they don't get written over accidentally.
main_folder <- "inst/extdata/OSU_2019_example" data_folder <- here::here(main_folder, "data-raw") auxiliary_files <- here::here(main_folder, "auxiliary_files") knitroutput_folder <- here::here(main_folder, "output")
If there are any mistakes in numeric values or one-off typos, sometimes it is faster to fix it in the raw files. Note any manual changes here in the .Rmd for reproducibility.
If the header is not in the first row (or spans multiple rows),
write.csv(sheet_info, ...)
and manually fill in the
number of rows to skip in each sheet. This data.frame will be read into
read_multsheets()
files_csv <- list.files(data_folder, recursive = TRUE, pattern = ".csv") sheet_info_csv <- data.frame(filename = files_csv) %>% filter(!str_detect(filename, "curation_files")) %>% mutate(list_names = basename(filename)) %>% mutate(list_names = str_remove(list_names, ".csv")) knitr::kable(sheet_info_csv)
This section is just to demonstrate how to handle excel files with multiple sheets. The data won't be processed in the rest of this vignette.
files_xls <- list.files(data_folder, recursive = TRUE, pattern = ".xls") sheet_info_xls <- list_sheetnames(data_folder, files_xls) knitr::kable(sheet_info_xls)
data_raw1 <- read.multsheets(data_folder, sheet_info_csv, na = c("NA"), col_names = TRUE) data_raw2 <- imap(data_raw1, function(x, y){ x %>% mutate(sourcefile = y)}) knitr::kable(head(data_raw2[[1]])) knitr::kable(head(data_raw2[[2]])) #save(data_raw2, # file = here::here(knitroutput_folder, "data_raw2.RData"))
Use facthelpeR::rm_dup_col()
because it can detect and remove duplicates when:
- column names are not exact, but column contents are
- if the only difference between two column contents are that one has some NAs,
then that column is removed and the other is retained
data_raw3 <- imap(data_raw2, function(x,y){ print(y) rm_dup_col(x, FALSE) %>% select(where(~!all(is.na(.)))) })
colnames <- summarize_variables(data_raw3, reg_ex = NULL) colnames_summary <- colnames %>% group_by(colname) %>% summarize(across(everything(), ~ paste(na.omit(unique(.x)), collapse = "; "))) write.csv(colnames_summary, here::here(knitroutput_folder, "colnames_summary.csv"), row.names = FALSE) knitr::kable(colnames_summary)
First, understand what each column means (description and units) by:
checking through annual reports
checking with Julia
lastly, asking the PI
Then, fill out the auxiliary file with the appropriate controlled vocabulary.
If there isn't a corresponding controlled vocabulary, then need to:
decide if the column can be captured by current controlled vocabulary. If so, standardize to the controlled vocabulary
decide whether we need to keep the variable at all (can fill in auxiliary file with "rm_col" in the controlled_vocab column and rename.col() will remove them if rm_col argument = TRUE)
decide whether a new controlled vocabulary should be added. If so, add to controlled vocabulary using the correct formatting, which is described in the README (I usually get Julia's opinion on this, especially before adding new vocabularies)
colnames_curation <- read.csv(here::here(auxiliary_files, "colnames_summary_aux.csv")) %>% # If a column name still needs to be checked/confirmed, leave the controlled_vocab # column blank in the "colnames_summary_aux.csv" file. # Before the renaming step, just set the unknowns to the # original name. mutate(controlled_vocab = ifelse(controlled_vocab == "", colname, controlled_vocab)) knitr::kable(colnames_curation) data_raw_rename1 <- rename_col(data_raw3, rename_df = colnames_curation, rename_col = controlled_vocab, old_col = colname, rm_col = TRUE) knitr::kable(head(data_raw_rename1[[1]])) knitr::kable(head(data_raw_rename1[[2]])) save(data_raw_rename1, file = here::here(knitroutput_folder, "data_raw_rename1.RData"))
There are often instances where column names are not used consistently between files or there are similar column names within one dataset (test vs. testwt). The following functions may help to summarize and explore the data.
summarize_variables()
used above.summarize_ncol()
and summarize_colnames
summarizes the information
collected from summarize_variables()
List the number of columns that match a given regex per dataset.
cols_summary <- summarize_ncol(data_raw_rename1, c("entry", "length")) knitr::kable(cols_summary)
List the proportion of files that contain a given
cols_prop <- summarize_colnames(data_raw_rename1, c("entry", "length")) knitr::kable(cols_prop)
Pull all the columns that match a given regex. This is useful for checking to see whether column names have been used consistently within and across datasets.
cols <- select_colsfromlist(data_raw_rename1, c("entry", "length")) knitr::kable(cols)
data_all1 <- bind_rows(data_raw_rename1) save(data_all1, file = here::here(knitroutput_folder,"data_all1.RData"))
See the factcuratoR package for validating columns and column contents
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.