knitr::opts_chunk$set(echo = TRUE, comment = "#>")

Summary

Description

This project is part of the Agroimpacts working group and seeks to use R programming to give accuracy statistics on polygons of farm plots in Zambia which have been created by Amazon Mechanical Turk crowdsourced Workers. Specifically, this project will compare Mechanical Turk polygons to a sampling of ground-truthed polygons to determine the true positive, false positive, and false negative areas of each polygon. The package will consist primarily of a reproducible program which will iterate through all of the ground truth polygons in a file, match them with Mechanical Turk polygons by a common grid value, then for each grid, give the accuracy statistics of each polygon in that grid which intersects with 5% or more of a ground truth polygon.

These statistics will be useful in understanding the quality of work done by the Mechanical Turk workers and to draw conclusions about how reliable other (unsampled) polygons are, as well as which workers' products are most reliable.

Primary Objectives

Existing Code / Packages:

A major component of this project is the accuracy statistics logic, which was derived from the rmapaccuracy package (developed by members of the Agroimpacts project).

The map accuracy function creates each of our categories in the following way, which will be key to gathering the statistics:

#"truth" is a given truth polygon
#"maps" is a given Mechanical Turks polygon

#True Positive
tp <- st_intersection(truth, maps)  

#False Positive
fp <- st_difference(maps, truth)  

#False Negative
fn <- st_difference(truth, maps)  

#Overall Accuracy

The following packages were utilized either directly or indirectly in our final output

Data

This project relies on three data sets:

The image below shows what the looping structure we have created will compile for a given truth polygon after the first few lines of the script. This is not actually plotted in the script, but is shown here for visualization purposes.

knitr::include_graphics("figures/Overlay2.png")

Below is an example of how a truth polygon and a worker polygon's geometries are evaluated for accuracy:

knitr::include_graphics("figures/Accuracy_images_all.png")

The truth polygon and worker polygon are shonw in yellow and pink, respectively. False positive areas are shown in red (areas present in worker polygons, but not in their truth polygon counterpart). True postive area are shown in blue (shared positive geometry between a worker and a truth polygon). False negative areas are shown in grey (areas present in truth polygon, but not in their crowdsourced counterparts).

Methodology and Workflow

Methodology

Our original code relied on a series of for loops that iterated over truth polygons and then attempted to match them with their corresponding mechanical turk assignments with the goal of comparing each valid worker assignment to its corresponding truth polygon. We worked through about 65% of our workflow before consulting with Professor Estes, where we developed a new workflow approach. However, the work done with this original workflow allowed us to familiarize ourselves with the data and the iteration and indexing techniques needed for the completion of this project using the second workflow.

For reference, this incomplete code is included at the end of the document, Appendix 1.

Some disadvantages of our original approach were:

These disadvantages pushed us to adopt a newer approach, which substituted for loops with the use of lapply and structured the filtering. This resulted in:

Final Workflow

finaldataset <- the aggregated final output tibble with the following data columns:

| Column | Description | |----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------| | truthid | Truth polygon identification number | | gid | Unique grid identification number | | workerid | Unique worker identification number | | polyid | Polygon identification number of a worker polygon that satisfies our threshold of 5% intersected area or above, listed as "NA" for polygons with no intersection. | | tp | True positive, intersected area between truth and worker polygon. | | fp | False positive, area present in worker polygon that is not present in truth polygon. | | fn | False negative, area omitted by worker polygon that is present in truth. | | accuracy | Percentage of true positive area shared between the worker and the truth polygon |

Final Code

Below is the final script, which must be read in with the datasets below. If this does not build directly with the package, the files can be downloaded from our package in "data/CSV Data/" and read locally into the script.

library(sf)
library(dplyr)

## DATA MANAGEMENT
# maps made by Mechanical Turk workers
zamworkers_st  <- st_read("data/CSV Data/worker_fmaps.sqlite")

# ground-truth fields
zamtruth_st  <- st_read("data/CSV Data/truth_fmatch.sqlite")

# Worker ids and assignments assignments
zamassn <- read.csv("data/CSV Data/fassignments.csv")

# joining zamassn to worker polys to associate worker ids with polygons
zamworkers_st <- inner_join(x = zamworkers_st, y = zamassn, by = "assignment_id")

# Accuracy Function for use in the final loop
acc_stats_sum <- function(tp, fp, fn) {
  agree <- tp / sum(tp, fn)  # Simple agreement class 1
  if(is.na(agree)) agree <- 0  # Set to 0 if NA
  accuracy <- sum(tp) / sum(tp, fp, fn)
  #TSS <- agree + (tn / (fp + tn)) - 1  # Sens + specificity - 1
  r1 <- round(accuracy, 2)
  out <- c(r1)
  names(out) <- c("accuracy")
  return(out)
}

# Core looping structure

finaldataset <- do.call(rbind,lapply(1:nrow(zamtruth_st), function(x) {
  # x <- 1 (for testing)
  # Initializing
  p <- zamtruth_st[x, ]
  wdat <- zamassn %>% filter(zamassn$name == as.character(p$name))
  wmaps <- zamworkers_st %>% filter(name.x == as.character(p$name)) %>%
    filter(.$worker_id %in% wdat$worker_id)
  wnomaps <- wdat %>% filter(!worker_id %in% wmaps$worker_id)

# accuracy statistics info for all assignments corresponding with each truth poly
  # first tibble for case of worker assigned to grid, but no polygons mapped
  wnomap_stat <- tibble(truthid = p$id, gridid = p$name,
                        worker_id = wnomaps$worker_id, polyid = NA, tp = 0,
                        fp = 0, fn = p %>% st_area(), accuracy = 0)
  # New Lapply to iterate through polygons mapped by workers in a given grid
  temp_wmap_stat <- do.call(rbind, lapply(unique(wdat$worker_id), function(y) {
    # y <- unique(wdat$worker_id)[1] (for testing)
    ipoly <- st_intersects(p, wmaps %>% filter(worker_id == y)) %>% unlist
    wmaps_sub <- wmaps %>% filter(worker_id == y) %>% st_buffer(., 0)
    # first condition for cases of no intersecting polygons with truth polygon
    if(length(ipoly) == 0){
      tp <- 0
      fp <- 0
      fn <- p %>% st_area()
      acc <- 0
      # first tibble summarizing results of first condition
      wmap_stat <- tibble(truthid = p$id, gridid = p$name, worker_id = y,
                          polyid = NA, tp = tp, fp = round(fp, 4),
                          fn = round(fn, 4), accuracy = round(acc, 5))
    } else if(length(ipoly) == 1) {
      # second condition for cases of a single intersection
      tempwmaps <- st_buffer(wmaps %>% filter(worker_id == y), 0)
      iarea <- st_intersection(p, st_buffer(tempwmaps, 0)) %>% st_area()
      # excludes worker polygons that share less than 5% of the truth polygon's area
      if((iarea %>% as.numeric())/(p %>% st_area() %>% as.numeric()) >= 0.05){
        tp <- st_intersection(p, wmaps_sub) %>% st_area() %>% sum(.)
        fp <- st_difference(wmaps_sub, p) %>% st_area() %>% sum(.)
        fn <- st_difference(p, wmaps_sub) %>% st_area() %>% sum(.)
        acc <- sum(tp) / sum(tp, fp, fn)
        # second tibble summarizing the results of this condition
        wmap_stat <- tibble(truthid = p$id, gridid = p$name, worker_id = y,
                            polyid = tempwmaps[ipoly[1],]$gid, tp = tp,
                            fp = round(fp, 4), fn = round(fn, 4),
                            accuracy = round(acc, 5))
      } else{
        # else statement assigns "NA" to polygons which do not meet 5% threshold 
        # to identify them.
        wmap_stat <- tibble(truthid = p$id, gridid = p$name, worker_id = y,
                            polyid = tempwmaps[ipoly[1],]$gid, tp = NA,
                            fp = NA, fn = NA,
                            accuracy = NA)
      }
    } else { # (for testing) if(length(ipoly) > 1){
      # third condition for cases where there are two or more polygons in the same grid
      iterlength <- length(ipoly)
      maxarea <- 0
      tempwmaps <- st_buffer(wmaps %>% filter(worker_id == y), 0)
      wmap_stat <- tibble()
      for (i in 1:iterlength) {
        iarea <- st_intersection(p, st_buffer(tempwmaps[ipoly[i],], 0)) %>%
          st_area()
        if((iarea %>% as.numeric())/(p %>% st_area() %>% as.numeric()) >= 0.05){
          # excludes worker polygons that share less than 5% of the truth polygon's area
          tp <- iarea %>% sum(.)
          fp <- st_difference(st_buffer(tempwmaps[ipoly[i],], 0), p) %>%
            st_area() %>% sum(.)
          fn <- st_difference(p, st_buffer(tempwmaps[ipoly[i],], 0)) %>%
            st_area() %>% sum(.)
          acc <- sum(tp) / sum(tp, fp, fn)
          # third tibble summarizing the results of this condition and appending
          # to any previous iterations
          wmap_stat <- rbind(wmap_stat, tibble(truthid = p$id, gridid = p$name,
                                               worker_id = y,
                                               polyid = tempwmaps[ipoly[i],]$gid,
                                               tp = tp, fp = round(fp, 4),
                                               fn = round(fn, 4),
                                               accuracy = round(acc, 5)))

        } else {
          # else statement assigns "NA" to polygons which do not meet 5% threshold 
          # to identify them.
          wmap_stat <- rbind(wmap_stat, tibble(truthid = p$id, gridid = p$name,
                                               worker_id = y,
                                               polyid = tempwmaps[ipoly[i],]$gid,
                                               tp = NA, fp = NA,
                                               fn = NA,
                                               accuracy = NA))
        }
      }
      wmap_stat <- wmap_stat
    }
  }))
  wmap_stat <- rbind(temp_wmap_stat, wnomap_stat)
}))

View(finaldataset)

Here is the first few lines of the output tibble generated by this code:

knitr::include_graphics("figures/table.png")

Conclusion / Outcomes

Outcomes

This script achieved project goals by generating accuracy assessments for Amazon Mechanical Turk worker maps in Zambia using ground truth polygons as a reference. This script could be adapted and used in other crowdsourced digitizing accuracy assessment initiatives and the data captured could be further analyzed to determine the accuracy of each worker, accuracy within each grid or accuracy by polygons of certain size thresholds, to give a few examples.

Our objectives were reached as follows:

Important Note/Consideration

As the looping code is written, we have integrated a thresholding mechanism to exclude polygons which do not meet a threshold of greater than or equal to 5% area intersection with a given truth polygon. Worker polygons intersecting under this threshold were deemed to be erronously intersecting the truth polygon. This means that no accuracy statistics are reported in the tibble for these polygons, even though other "misses" (including wnomaps and cases where a worker mapped a polygon but it did not intersect with the truth polygon were recorded as false negatives with an area that equalled the area of the truth polygon (a full false negative or "miss")). Because these excluded polygons did not truly have the false negative area of the full truth polygon, we did not feel comfortable reporting that number as the statistic, but also did not want to report full accuracy statistics as these polygons are likely error and this would defeat the purpose of thresholding at all.

We ultimately chose to report the excluded polygons giving only the polygon ID number and reporting "NA"" for all accuracy statistics so that they can be easily identified in the final tibble if full accuracy statistics, a false negative reporting, or full exclusion of these polygons is desired, this will only require minor changes to the core looping structure to report these changes.

Please feel free to contact any of the authors for more information or assisstance.

Appendix

Appendix 1: Original looping structure for illustration of early workflow

for (i in 1:nrow(zamtruth_sub)){ 
  if(zamtruth_sub$name[i] %in% zamworkers_sub$name){
    # print(i)
    templength <-
      length(zamworkers_sub[zamworkers_sub$name == zamtruth_sub$name[i], "gid"])
    print(templength)
    # 
    temp <- zamworkers_sub[zamworkers_sub$name == zamtruth_sub$name[i], "gid"]
    print("11111111111111111111111111111111111111")
    # truth polygon that temp intersected with
    temptruth <-
      zamtruth_sub[zamtruth_sub$name==zamtruth_sub$name[i],]
    tempgidlist <- list()
    for (poly in 1:templength) {
      if(gIntersects(spgeom1 = temp[poly,],
                     spgeom2 = temptruth,
                     byid = T)==TRUE){
        # temp[poly,] %>% plot()
        # temptruth %>% plot(col="red",add=TRUE)
        # print("haha")
        tempgidlist <- tempgidlist %>%
          append(polygons(temp[poly,]))

      }
      else{
        message("no intersection")
      }

      # print("2222222222222222222222222222222222222")
      print(tempgidlist)
    }
  }
}


agroimpacts/farmapz documentation built on May 7, 2019, 12:51 a.m.