In this vignette we'll describe the process used to prepare and check data in the CHAI project where 207 participant-daus were annotated for 5 passes (cooking, indoors/outdoors, travel, occupation, presence of other combustion) amounting to 1035 coding results files. Moreover, 46 participant-days were coded a second time by a coder different from the first one for computing inter-rater agreement. For each participant-day, participants were instructed to wear the autographer at all times, except of course when they didn't feel at ease recording their environment, e.g. during personal care or if someone was breastfeeding (in that case they could use the cache of the camera) and except at night during which the device had to be turned off to preserve the battery.

Annotation

Annotation ontology in the CHAI project

path_dico <-  system.file("extdata", "dico_coding_2016_01.csv", package = "watchme")
dico <- readr::read_csv2(path_dico)
knitr::kable(dico)

Software used for annotation

We chose to use the XnView MP software after getting some issues with the Doherty browser. The guidelines for using the software are included in this package, you'll find where exactly by typing system.file("extdata", "annotating_pictures_with_xnviewMP.pdf", package = "watchme") in the R console after installing the package.

Coder training

All coders underwent training before being taken on for the coding. An useful function for assessing training success was to output differences from coding results of two coders, so that they might be able to discuss the problems.

library("watchme")
data('coding1')
data('coding2')
# With two coders
results_list <- list(coding1, coding2)
names_list <- c('Cain', 'Abel')
watchme_output_differences(results_list = results_list,
 names_list = names_list)

Data preparation

Parsing of filenames

In the CHAI project, filenames contain the village ID, the participant ID, the date, the number of the session for the participant, the pass (or group of codes). They were parsed using a function from an internal package that produced a data.frame with all the variables from the filenames including the original filenames. The same function also allowed checking that all filenames were correct, i.e. containing existing participant IDs.

Conversion of all files

We wrote this function

library("purrr")
library("dplyr")
library("readr")
library("tidyr")
prepare_data_from_df <- function(df, path){
  print(df$filename)
  path_results <- paste0(getwd(), path, df$filename)
  sep_results <- "\t"
  path_dico <-  paste0(getwd(), "/raw_data/dicos/dico_coding_2016_01_", df$pass, ".csv")
  sep_dico <- ';'

  results <- watchme_prepare_data(path_results = path_results,
                                  sep_results = sep_results,
                                  path_dico = path_dico,
                                  sep_dico = sep_dico,
                                  tz = "Asia/Kolkata",
                                  robust_reading = TRUE,
                                  participant_id = toString(c(df[,c("villageID",
                                                                    "participantID",
                                                                    "session",
                                                                    "repeated")],
                                                              as.character(df$"date"),
                                                              df[,c("deviceType",
                                                                    "deviceID")])))
  results$image_path <- gsub('\"', "", results$image_path)
  results <- filter_(results, lazyeval::interp(~image_path != ""))
  results
}

for transforming each file and we mapped it to the files data.frame prepared from filenames.

results_list <- files %>% purrrlyr::by_row(prepare_data_from_df,
                                        path = "/raw_data/AutographerCodes_NewProtocol/")

In the resulting results_list there is one line per pass with the coding results in one list column. We transformed it to have a data.frame with one line per participant-day.

tables_participantdays <- results_list %>%
  group_by(participantID, date) %>%
  select(participantID, date, .out) 
tables_participantdays <- tables_participantdays %>%
  summarize(table = list(.out)) 
tables_participantdays <- tables_participantdays %>% 
  group_by(participantID, date)

Binding of the 5 passes of each day

Now there is a list in each line that contains all 5 coding results data.frames from that participant-day. They are then combined after identifying problematic participant-days where e.g. one of the passes had less pictures than the others. Such problematic files were corrected until there was no formatting issues at all.

We admit it's a quite ugly code but it did its job.

watchme_combine_results2 <- function(df, common_codes){
  output <- watchme_combine_results(df, common_codes = common_codes) 
  dico <- attr(output, "dico")
  output <- separate_(output,
                      "participant_id", 
                      into = c("villageID",
                               "participant_id",
                               "session",
                               "repeated",
                               "date",
                               "deviceType",
                               "deviceID"),
                      sep = ",") %>%
    mutate_(date = lazyeval::interp(~ymd(date)))
  attr(output, "dico") <- dico
  output
}
problem <- tibble()
notgood <- NULL
participant_days <- unique(select(tables_participantdays, participantID, date))
for(i in 1:nrow(participant_days)){
  lala <- try(left_join(participant_days[i,], tables_participantdays, by = c("participantID", "date")) %>%
                mutate(table = list(watchme_combine_results2(table[[1]],
                                                             common_codes = c("non_codable")))),
              silent = TRUE)
  if(class(lala) == "try-error"){
    numbers <- toString(unique(unlist( tables_participantdays[i,]$table[[1]] %>% lapply(nrow))))

    now <- bind_cols(ungroup(tables_participantdays[i,1:2]), data.frame(numbers = numbers))
    problem <- bind_rows(problem, now)
    notgood <- c(notgood, i)
  }

}

if(nrow(problem) > 0){
  print("hey there are some formatting issues for a few participant-days!")
  probs <- NULL
  for(i in notgood){
    times <- do.call("cbind", lapply(tables_participantdays[i,]$table[[1]],
                                     "[[", "image_time"))
    times <- as.data.frame(times)

    times <- mutate_all(times, posixing)
    posixing <- function(x){
      as.POSIXct(x, origin = "1970-01-01")
    }

    prob <- which(vapply(apply(times,1, unique), length, FUN.VALUE = 1) > 1)
    prob <- times[c(min(prob) - 1, prob),]
    names(prob) <- c("CK", "IO", "OP", "PM", "TP")
    prob <- mutate(prob, participantID = tables_participantdays[i,]$participantID)
    probs <- bind_rows(probs, prob)
  }

  write_csv(problem, path = "raw_data/problems.csv")
  write_csv(probs, path = "raw_data/problems_help.csv")

  tables_participantdays <- tables_participantdays[- notgood,] %>%
    mutate(table = list(watchme_combine_results2(table[[1]],
                                                 common_codes = c("non_codable"))))

The table was then unnested and saved. It was still intermediary because we needed to remove calibration pics.

tables_participantdays <- ungroup(tables_participantdays)
tables_participantdays <- select(tables_participantdays, - date)
tables_participantdays <- unnest(tables_participantdays, table)
tables_participantdays <- select(tables_participantdays, - participant_id)

tables_participantdays <- mutate(tables_participantdays, image_time = as.character(image_time))
write_csv(tables_participantdays, path = "intermediary_data/autographer_coding.csv")

Remove calibration pics

At the beginning of each measurements session the autographer was used to take pictures of a clock. This pictures were not included in the study. For each participant-day we knew the name of the first "real" picture of the day and removed the others.

autographer <- read_csv("intermediary_data/autographer_coding.csv")

############################################################
#                                                          #
#                     calibration pics                     #
#                                                          #
############################################################
calibration <- read_csv("raw_data/calibration_pics.csv")
names(calibration) <- c("filename", "startlabel")
calibration <- mutate(calibration, startlabel = gsub("\\\t", "", startlabel))
calibration <- mutate(calibration, filename = paste0(filename, ".csv"))
calibration <- mutate(calibration, filename = gsub("-", "_", filename))
calibration <- purrrlyr::by_row(calibration, function(df){
  output <- parse_filename(df$filename)
  select_(output, quote(- filename))
}) %>%
  unnest(.out) 
calibration <- mutate(calibration, calibpic = TRUE)
calibration <- mutate(calibration, participantday = paste(participantID, date))
calibration <- arrange(calibration, participantday)


############################################################
#                                                          #
#                     find first pic                       #
#                                                          #
############################################################

autographerbis <- mutate(autographer, participantday = paste(participantID, date))
autographerbis <- arrange(autographerbis, participantday)
calibration <- filter(calibration, participantday %in%autographerbis$participantday)

autographerbis <- split(autographerbis, autographerbis$participantday)


calibrationbis <- split(calibration, calibration$participantday)

findpics <- function(coding_df, calibration_df){
  firstpic <- min(which(grepl(calibration_df$startlabel[1], coding_df$image_path)))
  coding_df[firstpic:nrow(coding_df),]
}

autographerbis <- map2(autographerbis, calibrationbis, findpics)
autographerbis <- bind_rows(autographerbis)

############################################################
#                                                          #
#                           save                           #
#                                                          #
############################################################

write_csv(autographerbis, path = "final_data/autographer_coding.csv")

Data checks

Number of images

After each session, the autographer camera produces a file which is a table of all images times and paths. We checked that the number of annotated pictures was equal to the number of pictures according to this file. In the cases where it was not, it was checked whether the pictures had been deleted for a good reason, e.g. privacy issue or their belonging to another participant-day whose pictures had not been removed from the camera yet.

This is how we extracted the number of images from the autographer file.

# get paths
files <- dir(paste0(getwd(), "/raw_data/ImageDat/"))
files <- files[grepl("dat", files)]
files <- chaimagic::batch_parse_filename(files)
files <- files$dataOk

find_count <- function(df){
  # open imageDAT file
  imagedat <- xml2::read_xml(paste0(getwd(), "/raw_data/ImageDat/", df$filename[1]))

  # count lines
  recs <- xml2::xml_find_all(imagedat, "//Image")
  return(length(xml2::xml_text(recs)))

}

# now add columns with count
files <- files %>% 
  purrrlyr::by_row(find_count, .to = "no_of_pics", .collate = "col")

write_csv(files, path = "raw_data/imageDAT.csv")

Checks based on codes

Compulsory codes

For each pic, it was compulsory to have exactly one of the following codes: non_codable, indoors, outdoors, in_vehicle, mixed (neither indoors nor outdoors, e.g. on a doorstep). This is how we identified problematic pics, whose annotations were corrected by the same coder who had coded the pass (except one coder who had by then left the study) until there was no issue.

autographer <- purrrlyr::by_row(autographer,
              function(df){
                sum(df$non_codable + df$indoors + df$outdoors + df$mixed + df$in_vehicle)
              }, .to = "loc_info", .collate = "cols")

filter(autographer, loc_info %in% c(0, 2)) %>%
       select(participantID, date, image_time, image_path, non_codable, indoors, outdoors,
              mixed, in_vehicle, loc_info) %>%
  write_csv("raw_data/problems/not_one_location.csv")

Codes incompatibility

We also had checks based on codes incompatibility only: it was impossible to have more than 1 code for the group occupation or more than 1 code for the group travel. For both groups, it was possible to have 0 or 1 code. This is how we find problematic pics.

############################################################
#                                                          #
#                          travel                          #
#                                                          #
############################################################
autographer <- filter(autographer, !non_codable)
# Travel by bus;Travel by bus;Travel
# Travel by bicycle;Travel by bicycle;Travel
# Travel by auto;Travel by auto;Travel
# Travel by motorcycle;Travel by motorcycle;Travel
# Participant presence on road;Participant presence on road;Travel
autographer <- purrrlyr::by_row(autographer,
                             function(df){
                               sum(df$travel_by_bus + df$travel_by_bicycle + df$travel_by_auto + df$travel_by_motorcycle +
                                     df$participant_presence_on_road)
                             }, .to = "travel_info", .collate = "cols")
filter(autographer, travel_info == 2) %>%
  select(participantID, date, image_time, image_path, 
         dplyr::contains("travel"), participant_presence_on_road, travel_info) %>%
  write_csv("raw_data/problems/not_one_travel.csv")

############################################################
#                                                          #
#                        occupation                        #
#                                                          #
############################################################

# Presence at Office or Shop;Presence at Office or Shop;Occupation
# Presence at Work Field;Presence at Work Field;Occupation
# Presence in Industry;Presence in Industry;Occupation
# Presence in Informal Work;Presence in Informal Work;Occupation
autographer <- purrrlyr::by_row(autographer,
                             function(df){
                               sum(df$presence_at_office_or_shop + df$presence_in_industry + 
                                     df$presence_in_informal_work  +
                                     df$presence_at_work_field)
                             }, .to = "occup_info", .collate = "cols")
filter(autographer, occup_info == 2) %>%
  select(participantID, date, image_time, image_path, 
         presence_at_office_or_shop, presence_in_industry,
         presence_in_informal_work, presence_at_work_field,
         occup_info) %>%
  write_csv("raw_data/problems/not_one_occup.csv")

Comparisons between coders



masalmon/watchme documentation built on May 21, 2019, 12:41 p.m.