In Spring 2016, we had our data-entry team re-enter test scores gathered in our studies, so that we could find data-entry discrepancies. This script compares the original to the re-entered scores.

Studies under consideration

Data from the following studies are checked:


Participant pool comparison

Do the same participants contribute scores in each set?

df_both <- bind_rows(df_dirt, df_info_long)
df_sources_per_id <- df_both %>% 
  select(-Variable, -Value) %>% 
  distinct() %>% 
  mutate(Found = TRUE) %>% 
  spread(Source, Found) %>% 
  arrange(Study, ParticipantID)

Participants in original score-set ("ParticipantInfo") not in the re-entered score-set ("DIRT"):

df_sources_per_id %>% 
  filter( %>% 

Participants in re-entered score-set ("DIRT") who visited the lab but are not in the original score-set ("ParticipantInfo").

df_sources_per_id %>% 
  filter( %>% 

Value Comparison

We now compare the scores in each score-set. This check is only being performed on participants in both score-sets.

# Find which kids appear in both sources
df_children_to_compare <- df_sources_per_id %>% 
  filter(!, ! %>% 
  select(Study, ParticipantID) %>% 

df_comparing_scores <- df_both %>% 

discrepancies <- df_comparing_scores %>%
  # select(-Study) %>%
  split(.$Variable) %>%
  lapply(readr::type_convert) %>%
  lapply(spread_, "Source", "Value") %>% 
  # lapply(function(df) select(df, DIRT)) %>%
  lapply(function(df) filter(df, DIRT %!==% ParticipantInfo)) %>%
  Filter(function(df) nrow(df) != 0, .) %>% 


This table lists all the fields that were checked and whether any discrepancies were found in that field.

columns_with_discrepancies <- discrepancies %>% names()

results <- data_frame(
  Check = "Discrepancies", 
  Date = format(Sys.Date()),
  Passing = nrow(bind_rows(discrepancies)) == 0)

readr::write_csv(results, "./inst/audit/results_integrity.csv")

columns_with_errors <- discrepancies %>% 
  lapply(. %>% select(Study, Variable)) %>% 
  bind_rows() %>% 
  distinct() %>% 
  arrange(Study, Variable) %>% 
  mutate(Passing = FALSE)

# Make a mock-error table if there are no errors
if (nrow(columns_with_errors) == 0) {
  columns_with_errors <- data_frame(
    Study = NA_character_, Variable = NA_character_, Passing = FALSE

columns_in_info_sheets %>% 
  select(Study, Variable) %>% 
  distinct %>% 
  left_join(columns_with_errors) %>% 
  mutate(Passing = if_else(, TRUE, Passing)) %>% 
  arrange(Study, Variable) %>% 
  mutate(Status = if_else(Passing, ":white_check_mark:", ":x:")) %>% %>% 
  rename(` ` = Status) %>% 


These are all the mismatching values.


Unchecked fields

The following columns in DIRT were not checked because there is not a matching column in the participant info spreadsheets

columns_in_dirt_by_study <- df_dates %>% 
  select(Study, Variable) %>% 

columns_in_dirt_by_study %>% 
  anti_join(columns_in_info_sheets) %>% 
  filter(Variable != "DOB", Variable != "Visited") %>% %>% 
  mutate(Status = ":grey_question:") %>% 
  rename(` ` = Status) %>% 

