In this vignette we'll describe the process used to prepare and check data in the CHAI project where 207 participant-daus were annotated for 5 passes (cooking, indoors/outdoors, travel, occupation, presence of other combustion) amounting to 1035 coding results files. Moreover, 46 participant-days were coded a second time by a coder different from the first one for computing inter-rater agreement. For each participant-day, participants were instructed to wear the autographer at all times, except of course when they didn't feel at ease recording their environment, e.g. during personal care or if someone was breastfeeding (in that case they could use the cache of the camera) and except at night during which the device had to be turned off to preserve the battery.
path_dico <- system.file("extdata", "dico_coding_2016_01.csv", package = "watchme") dico <- readr::read_csv2(path_dico) knitr::kable(dico)
We chose to use the XnView MP software after getting some issues with the Doherty browser. The guidelines for using the software are included in this package, you'll find where exactly by typing system.file("extdata", "annotating_pictures_with_xnviewMP.pdf", package = "watchme")
in the R console after installing the package.
All coders underwent training before being taken on for the coding. An useful function for assessing training success was to output differences from coding results of two coders, so that they might be able to discuss the problems.
library("watchme") data('coding1') data('coding2') # With two coders results_list <- list(coding1, coding2) names_list <- c('Cain', 'Abel') watchme_output_differences(results_list = results_list, names_list = names_list)
In the CHAI project, filenames contain the village ID, the participant ID, the date, the number of the session for the participant, the pass (or group of codes). They were parsed using a function from an internal package that produced a data.frame with all the variables from the filenames including the original filenames. The same function also allowed checking that all filenames were correct, i.e. containing existing participant IDs.
We wrote this function
library("purrr") library("dplyr") library("readr") library("tidyr") prepare_data_from_df <- function(df, path){ print(df$filename) path_results <- paste0(getwd(), path, df$filename) sep_results <- "\t" path_dico <- paste0(getwd(), "/raw_data/dicos/dico_coding_2016_01_", df$pass, ".csv") sep_dico <- ';' results <- watchme_prepare_data(path_results = path_results, sep_results = sep_results, path_dico = path_dico, sep_dico = sep_dico, tz = "Asia/Kolkata", robust_reading = TRUE, participant_id = toString(c(df[,c("villageID", "participantID", "session", "repeated")], as.character(df$"date"), df[,c("deviceType", "deviceID")]))) results$image_path <- gsub('\"', "", results$image_path) results <- filter_(results, lazyeval::interp(~image_path != "")) results }
for transforming each file and we mapped it to the files
data.frame prepared from filenames.
results_list <- files %>% purrrlyr::by_row(prepare_data_from_df, path = "/raw_data/AutographerCodes_NewProtocol/")
In the resulting results_list
there is one line per pass with the coding results in one list column. We transformed it to have a data.frame with one line per participant-day.
tables_participantdays <- results_list %>% group_by(participantID, date) %>% select(participantID, date, .out) tables_participantdays <- tables_participantdays %>% summarize(table = list(.out)) tables_participantdays <- tables_participantdays %>% group_by(participantID, date)
Now there is a list in each line that contains all 5 coding results data.frames from that participant-day. They are then combined after identifying problematic participant-days where e.g. one of the passes had less pictures than the others. Such problematic files were corrected until there was no formatting issues at all.
We admit it's a quite ugly code but it did its job.
watchme_combine_results2 <- function(df, common_codes){ output <- watchme_combine_results(df, common_codes = common_codes) dico <- attr(output, "dico") output <- separate_(output, "participant_id", into = c("villageID", "participant_id", "session", "repeated", "date", "deviceType", "deviceID"), sep = ",") %>% mutate_(date = lazyeval::interp(~ymd(date))) attr(output, "dico") <- dico output } problem <- tibble() notgood <- NULL participant_days <- unique(select(tables_participantdays, participantID, date)) for(i in 1:nrow(participant_days)){ lala <- try(left_join(participant_days[i,], tables_participantdays, by = c("participantID", "date")) %>% mutate(table = list(watchme_combine_results2(table[[1]], common_codes = c("non_codable")))), silent = TRUE) if(class(lala) == "try-error"){ numbers <- toString(unique(unlist( tables_participantdays[i,]$table[[1]] %>% lapply(nrow)))) now <- bind_cols(ungroup(tables_participantdays[i,1:2]), data.frame(numbers = numbers)) problem <- bind_rows(problem, now) notgood <- c(notgood, i) } } if(nrow(problem) > 0){ print("hey there are some formatting issues for a few participant-days!") probs <- NULL for(i in notgood){ times <- do.call("cbind", lapply(tables_participantdays[i,]$table[[1]], "[[", "image_time")) times <- as.data.frame(times) times <- mutate_all(times, posixing) posixing <- function(x){ as.POSIXct(x, origin = "1970-01-01") } prob <- which(vapply(apply(times,1, unique), length, FUN.VALUE = 1) > 1) prob <- times[c(min(prob) - 1, prob),] names(prob) <- c("CK", "IO", "OP", "PM", "TP") prob <- mutate(prob, participantID = tables_participantdays[i,]$participantID) probs <- bind_rows(probs, prob) } write_csv(problem, path = "raw_data/problems.csv") write_csv(probs, path = "raw_data/problems_help.csv") tables_participantdays <- tables_participantdays[- notgood,] %>% mutate(table = list(watchme_combine_results2(table[[1]], common_codes = c("non_codable"))))
The table was then unnested and saved. It was still intermediary because we needed to remove calibration pics.
tables_participantdays <- ungroup(tables_participantdays) tables_participantdays <- select(tables_participantdays, - date) tables_participantdays <- unnest(tables_participantdays, table) tables_participantdays <- select(tables_participantdays, - participant_id) tables_participantdays <- mutate(tables_participantdays, image_time = as.character(image_time)) write_csv(tables_participantdays, path = "intermediary_data/autographer_coding.csv")
At the beginning of each measurements session the autographer was used to take pictures of a clock. This pictures were not included in the study. For each participant-day we knew the name of the first "real" picture of the day and removed the others.
autographer <- read_csv("intermediary_data/autographer_coding.csv") ############################################################ # # # calibration pics # # # ############################################################ calibration <- read_csv("raw_data/calibration_pics.csv") names(calibration) <- c("filename", "startlabel") calibration <- mutate(calibration, startlabel = gsub("\\\t", "", startlabel)) calibration <- mutate(calibration, filename = paste0(filename, ".csv")) calibration <- mutate(calibration, filename = gsub("-", "_", filename)) calibration <- purrrlyr::by_row(calibration, function(df){ output <- parse_filename(df$filename) select_(output, quote(- filename)) }) %>% unnest(.out) calibration <- mutate(calibration, calibpic = TRUE) calibration <- mutate(calibration, participantday = paste(participantID, date)) calibration <- arrange(calibration, participantday) ############################################################ # # # find first pic # # # ############################################################ autographerbis <- mutate(autographer, participantday = paste(participantID, date)) autographerbis <- arrange(autographerbis, participantday) calibration <- filter(calibration, participantday %in%autographerbis$participantday) autographerbis <- split(autographerbis, autographerbis$participantday) calibrationbis <- split(calibration, calibration$participantday) findpics <- function(coding_df, calibration_df){ firstpic <- min(which(grepl(calibration_df$startlabel[1], coding_df$image_path))) coding_df[firstpic:nrow(coding_df),] } autographerbis <- map2(autographerbis, calibrationbis, findpics) autographerbis <- bind_rows(autographerbis) ############################################################ # # # save # # # ############################################################ write_csv(autographerbis, path = "final_data/autographer_coding.csv")
After each session, the autographer camera produces a file which is a table of all images times and paths. We checked that the number of annotated pictures was equal to the number of pictures according to this file. In the cases where it was not, it was checked whether the pictures had been deleted for a good reason, e.g. privacy issue or their belonging to another participant-day whose pictures had not been removed from the camera yet.
This is how we extracted the number of images from the autographer file.
# get paths files <- dir(paste0(getwd(), "/raw_data/ImageDat/")) files <- files[grepl("dat", files)] files <- chaimagic::batch_parse_filename(files) files <- files$dataOk find_count <- function(df){ # open imageDAT file imagedat <- xml2::read_xml(paste0(getwd(), "/raw_data/ImageDat/", df$filename[1])) # count lines recs <- xml2::xml_find_all(imagedat, "//Image") return(length(xml2::xml_text(recs))) } # now add columns with count files <- files %>% purrrlyr::by_row(find_count, .to = "no_of_pics", .collate = "col") write_csv(files, path = "raw_data/imageDAT.csv")
For each pic, it was compulsory to have exactly one of the following codes: non_codable, indoors, outdoors, in_vehicle, mixed (neither indoors nor outdoors, e.g. on a doorstep). This is how we identified problematic pics, whose annotations were corrected by the same coder who had coded the pass (except one coder who had by then left the study) until there was no issue.
autographer <- purrrlyr::by_row(autographer, function(df){ sum(df$non_codable + df$indoors + df$outdoors + df$mixed + df$in_vehicle) }, .to = "loc_info", .collate = "cols") filter(autographer, loc_info %in% c(0, 2)) %>% select(participantID, date, image_time, image_path, non_codable, indoors, outdoors, mixed, in_vehicle, loc_info) %>% write_csv("raw_data/problems/not_one_location.csv")
We also had checks based on codes incompatibility only: it was impossible to have more than 1 code for the group occupation or more than 1 code for the group travel. For both groups, it was possible to have 0 or 1 code. This is how we find problematic pics.
############################################################ # # # travel # # # ############################################################ autographer <- filter(autographer, !non_codable) # Travel by bus;Travel by bus;Travel # Travel by bicycle;Travel by bicycle;Travel # Travel by auto;Travel by auto;Travel # Travel by motorcycle;Travel by motorcycle;Travel # Participant presence on road;Participant presence on road;Travel autographer <- purrrlyr::by_row(autographer, function(df){ sum(df$travel_by_bus + df$travel_by_bicycle + df$travel_by_auto + df$travel_by_motorcycle + df$participant_presence_on_road) }, .to = "travel_info", .collate = "cols") filter(autographer, travel_info == 2) %>% select(participantID, date, image_time, image_path, dplyr::contains("travel"), participant_presence_on_road, travel_info) %>% write_csv("raw_data/problems/not_one_travel.csv") ############################################################ # # # occupation # # # ############################################################ # Presence at Office or Shop;Presence at Office or Shop;Occupation # Presence at Work Field;Presence at Work Field;Occupation # Presence in Industry;Presence in Industry;Occupation # Presence in Informal Work;Presence in Informal Work;Occupation autographer <- purrrlyr::by_row(autographer, function(df){ sum(df$presence_at_office_or_shop + df$presence_in_industry + df$presence_in_informal_work + df$presence_at_work_field) }, .to = "occup_info", .collate = "cols") filter(autographer, occup_info == 2) %>% select(participantID, date, image_time, image_path, presence_at_office_or_shop, presence_in_industry, presence_in_informal_work, presence_at_work_field, occup_info) %>% write_csv("raw_data/problems/not_one_occup.csv")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.