From r system("git config --get remote.origin.url", intern = TRUE) on r date(), generated by r Sys.info()[["effective_user"]].

library(ggplot2)
library(scales)
library(popbioko)

Overview

This notebook concentrates on how we calculate the goodness-of-fit ratio (GOFR). It combines information from the "Accuracy Metrics" vignette and the "map comparison" vignette. If you want to run this notebook, check out the Github project, create a subdirectory of the project called inst/extdata, and unzip the zipfile in that directory. It will create subdirectories for the maps, the density maps, and the necessary shapefiles. You also need to install the popbioko project, which Rstudio does with a button under the Build menu.

We had a paper with a GOFR table and a GOFR explanation paragraph, and we were happy. Everything made sense. Then Drew changed everything, and now neither the explanations nor the numbers make sense. What happened, and how do we return to our happy place?

For one, the data changed. There are two kinds of maps, population maps and density maps. The density maps estimate population density from the population maps. We have estimated density maps in two different ways. One is to average over all pixels within a radius. The other is to more precisely draw a circle that crosses those same pixels and average area that covers parts of pixels.

I suspect that the size of the radius for blurring, for the two density maps, is different. I say this because plots showed more blurring for the integrated ones, and I did testing on them to see that they have $1\:\mbox{km}^2$ under the disc.

For another, I've been trying modifications to the GOFR because they seemed easier to explain and they didn't seem like they would make a difference. I was wrong. Let's take a look.

Possible GOFRs

The Actual GOFR

Define $H(x)$ is the gold standard at pixel $x$. Make $\hat{H}(x)$ the observed map. (We switched which variable gets the hat after Sean said the observed values generally get the hat for statistics.) We can use either population maps or density maps here. Use $\langle H\rangle$ for the mean of the gold standard and $\langle \hat{H}\rangle$ for the mean of the population map. $$ GOFR = \frac{\sum_x(\hat{H}(x) - H(x))^2}{\sum_x(\hat{H}(x) - \langle H\rangle)^2} $$

Normalize by Gold Standard Variance

I thought it was easier to say that this is sum of squared errors per pixel divided by gold standard variance. $$ GOFR = \frac{\sum_x(\hat{H}(x) - H(x))^2}{\sum_x(H(x) - \langle H\rangle)^2} $$

Normalize by Observed Variance

Or normalize by the map variance, if that works. The numbers look less huge. $$ GOFR = \frac{\sum_x(\hat{H}(x) - H(x))^2}{\sum_x(\hat{H}(x) - \langle \hat{H}\rangle)^2} $$

Look at Numbers

I've put the maps into a zip directory. These come from the "Get Data" vignette and include both population maps and raster maps. You'll see that they have a regular naming scheme.

Load Datasets

Load population maps. They will be in lat-long. This chunk will print the available maps. The original set are "HRSL on HRSL fine," "LandScan on LandScan coarse," and "WorldPop on WorldPop fine."

data_dir <- rprojroot::find_rstudio_root_file("inst/extdata")
map_root <- fs::path(data_dir, "aligned")
files <- list.files(map_root, pattern = "*.tif$")
files_df <- popbioko::filenames_to_description(files)
maps <- lapply(files, function(x) raster::raster(fs::path(map_root, x)))
names(maps) <- rownames(files_df)
files_df

Load density maps. They will be in UTM projection.

density_dir <- fs::path(data_dir, "density_raster")
density <- lapply(files_df$filename, function(x) raster::raster(fs::path(density_dir, x)))
names(density) <- names(maps)

In order to calculate the GOFR on smaller regions, we'll need shapefiles, too. The incoming shapefile is in UTM, which is approprate for overlay on the density maps but not the population maps, which are lat-long, so convert it.

# The input map is in UTM, so we transform it to lat-long in order to agree
# with the raster coordinates of all rasters.
admin2_projected_sf <- sf::st_read(rprojroot::find_package_root_file(
  "inst/extdata/admin2/bioko_admin2_fulldistricts.shp"))
bioko_sf <- sf::st_read(rprojroot::find_package_root_file(
  "inst/extdata/source/bioko.shp"))
admin2_projected_st <- as(sf::st_geometry(admin2_projected_sf), "Spatial")
admin2_sf <- sf::st_transform(admin2_projected_sf, raster::crs(maps[1][[1]]))
admin2_st <- as(sf::st_geometry(admin2_sf), "Spatial")
# This associates an index with an admin2 region name.
admin2_st_df <- as.data.frame(as(admin2_sf[, c("OBJECTID", "admin2")], "Spatial"))
tmap::tm_shape(admin2_sf) + tmap::tm_polygons("admin2")

GOFR Runner

These functions run the calculations. We put this here to get it out of the way. It's all about intersecting maps and organizing the data.

# Makes a data frame that lists each GOFR row we need to calculate.
blank_gofr_list <- function() {
  gofr <- files_df[files_df$source != "BIMEP",]
  rownames(gofr) <- paste(gofr$source, gofr$resolution)
  gofr$Bioko <- numeric(nrow(gofr))
  # We will calculate GOFR for each admin2, in addition to the whole island.
  for (admin_idx in 1:nrow(admin2_st_df)) {
    gofr[[admin2_st_df[admin_idx, "admin2"]]] <- numeric(nrow(gofr))
  }
  gofr
}


#' Checks that the relative sizes of the rasterized admin units matches
#' what we expect. That way we know they aren't ordered by some random 1-4.
check_rasterize <- function(values) {
  size_order <- c("Riaba", "Baney", "Malabo", "Luba")
  id_order <- admin2_st_df[match(size_order, admin2_st_df$admin2), "OBJECTID"]
  sizes <- unlist(lapply(seq(4), function(x) sum(values == x, na.rm = TRUE)))
  expected <- sizes[id_order]
  stopifnot(expected == sort(expected))
}


#' The bimep map should be nearly identical to the pops map in terms
#' of size and NA location because it was built to be that way.
check_maps_match <- function(pops_map, pops_bimep, gofr_pair_df) {
  stopifnot(length(pops_bimep) == length(pops_map))
  mapna <- is.na(pops_map)
  bmna <- is.na(pops_bimep)
  difference_cnt <- sum(mapna != bmna)
  total_finite <- length(pops_map) - sum(mapna | bmna)
  only_map <- sum(mapna & !bmna)
  only_bimep <- sum(bmna & !mapna)
  both_na <- sum(mapna & bmna)
  # There are some pixels by the seashore, and that's OK because it's
  # outside significant digits.
  if ((difference_cnt / total_finite) > 1) {
    print(gofr_pair_df)
    stop(paste(
      difference_cnt, "of the NA values don't match out of", total_finite,
      "with", both_na, "na values of which", only_map, "only in the map and",
      only_bimep, "in the gold"
      ))
  }
  list(
    bimep_finite = length(pops_bimep) - sum(bmna),
    bimep_na = sum(bmna),
    pops_finite = length(pops_map) - sum(mapna),
    pops_na = sum(mapna),
    only_bimep_na = only_bimep,
    only_pops_na = only_map,
    both_na = both_na
  )
}


#' If my map has a number where yours has an NA, then call yours
#' a zero. This goes both ways.
zero_where_disagree <- function(s1, s2) {
  s1na <- is.na(s1)
  s2na <- is.na(s2)
  s1[s1na & !s2na] <- 0
  s2[s2na & !s1na] <- 0
  list(bimep = s1, map = s2)
}


check_disagree <- function() {
  a <- c(1, 2, NA, NA)
  b <- c(3, NA, 4, NA)
  zwd <- zero_where_disagree(a, b)
  stopifnot(all(zwd[[1]][1:3] == c(1, 2, 0)))
  stopifnot(all(zwd[[2]][1:3] == c(3, 0, 4)))
  stopifnot(is.na(zwd[[1]][4]))
  stopifnot(is.na(zwd[[2]][4]))
}
check_disagree()


calculate_gofr <- function(normalize_gofr, gofr_map_type, gofr_algorithm) {
  # For each grid and resolution, there should be a map and a BIMEP version of that map.
  gofr <- blank_gofr_list()
  sse <- blank_gofr_list()
  den <- blank_gofr_list()

  for (gofr_make_idx in 1:nrow(gofr)) {
    gofr_source <- gofr[gofr_make_idx, "source"]
    gofr_grid <- gofr[gofr_make_idx, "grid"]
    gofr_resolution <- gofr[gofr_make_idx, "resolution"]
    bimep_name <- files_df[
      files_df$grid == gofr_grid & files_df$resolution == gofr_resolution & files_df$source == "BIMEP",
      "name"
      ]
    pops_bimep <- raster::getValues(gofr_map_type[bimep_name][[1]])
    compare_map <- gofr_map_type[gofr[gofr_make_idx, "name"]][[1]]
    pops_map <- raster::getValues(compare_map)

    # Bioko for the whole island.
    zwd <- zero_where_disagree(pops_bimep, pops_map)
    island <- gofr_algorithm(zwd$map, zwd$bimep, normalize_gofr)
    gofr$Bioko[gofr_make_idx] <- island$num / island$den
    sse$Bioko[gofr_make_idx] <- island$num
    den$Bioko[gofr_make_idx] <- island$den

    # Iterate through the four admin2 by ID.
    is_lonlat <- raster::isLonLat(raster::crs(gofr_map_type[[1]][[1]]))
    if (is_lonlat) {
      use_map <- admin2_st
    } else {
      use_map <- admin2_projected_st
    }
    # Identify pixels by the object that overlaps so we can pull them out below.
    segmented <- raster::getValues(raster::rasterize(
      use_map, compare_map, field = admin2_st_df$OBJECTID))
    check_rasterize(segmented)
    for (adm2_idx in 1:nrow(admin2_st_df)) {
      store_name <- admin2_st_df[adm2_idx, "admin2"]
      adm2_object_id <- admin2_st_df[adm2_idx, "OBJECTID"]
      sub_bimep <- pops_bimep[segmented == adm2_object_id]
      sub_pops <- pops_map[segmented == adm2_object_id]
      zwd <- zero_where_disagree(sub_bimep, sub_pops)
      gofr_admin <- gofr_algorithm(zwd$map, zwd$bimep, normalize_gofr)
      gofr[[store_name]][gofr_make_idx] <- gofr_admin$num / gofr_admin$den
      sse[[store_name]][gofr_make_idx] <- gofr_admin$num
      den[[store_name]][gofr_make_idx] <- gofr_admin$den
    }
  }
  list(gofr = gofr, sse = sse, den = den)
}

#' This makes a dataframe with both unnormalized and normalized gofr, ready to print.
#' @param gofr_map_type These are either the count maps or the density maps.
#' @param gofr_algorithm This is one of the functions below to calculate SSE and
#'    variance in different ways.
#' @param map_choice A list of which of the items in the gofr_map_type to run against.
gofr_table <- function(gofr_map_type, gofr_algorithm, map_names) {
  gofr_absolute <- calculate_gofr(FALSE, gofr_map_type, gofr_algorithm)
  gofr_normalized <- calculate_gofr(TRUE, gofr_map_type, gofr_algorithm)
  cols <- c("source", "resolution", "Bioko", "Baney", "Luba", "Malabo", "Riaba")
  if (is.null(map_names)) {
    map_choice <- 1:length(gofr_absolute$gofr)
  } else {
    map_choice <- vapply(1:length(map_names),
                         function(i) which(gofr_absolute$gofr$name == map_names[i]),
                         FUN.VALUE = integer(1))
  }
  cat(paste("gofr names", paste(row.names(gofr_absolute), collapse = ",")))
  gofr_table <- rbind(
    gofr_absolute$sse[map_choice, cols],
    gofr_absolute$den[map_choice, cols],
    gofr_absolute$gofr[map_choice, cols],
    gofr_normalized$gofr[map_choice, cols]
    )
  kinds <- rep(c("SSE", "Var", "absolute", "normed"), each = length(map_names))
  row.names(gofr_table) <- paste(kinds, gofr_table$source, gofr_table$resolution)
  gofr_table[, !names(gofr_table) %in% c("grid", "filename", "name")]
}

Just to make sure the world makes sense, let's take a look at the maps in order to check that the BIMEP and the given map agree nicely.

examine_map_matches <- function(gofr_map_type) {
  # For each grid and resolution, there should be a map and a BIMEP version of that map.
  gofr <- blank_gofr_list()

  results <- list()
  for (gofr_make_idx in 1:nrow(gofr)) {
    gofr_source <- gofr[gofr_make_idx, "source"]
    gofr_grid <- gofr[gofr_make_idx, "grid"]
    gofr_resolution <- gofr[gofr_make_idx, "resolution"]
    bimep_name <- files_df[
      files_df$grid == gofr_grid & files_df$resolution == gofr_resolution & files_df$source == "BIMEP",
      "name"
      ]
    pops_bimep <- raster::getValues(gofr_map_type[bimep_name][[1]])
    compare_map <- gofr_map_type[gofr[gofr_make_idx, "name"]][[1]]
    pops_map <- raster::getValues(compare_map)
    matchy <- check_maps_match(pops_map, pops_bimep, gofr[gofr_make_idx,])
    matchy["source"] <- gofr_source
    matchy["resolution"] <- gofr_resolution
    results[[gofr_make_idx]] <- matchy
  }
  data.frame(do.call(rbind, results))
}
rmm <- examine_map_matches(maps)
rmm["kind"] <- "pop"
rmd <- examine_map_matches(density)
rmd["kind"] <- "dens"

That isn't printing well, but what we see are low numbers of differences except for LandScan fine, which I created by disaggregating LandScan, so that will mean a large seashore pixel will become fine pixels, which are farther from the shore. The count is 5656 / 227793, so it's 2.5 percent. We aren't using that one for GOFR, so I'll allow it.

WorldPop has about 2500 / 227804 that disagree about NA values. That's one percent. It looks like there is disagreement about the island boundary in the Southwest. That's why it gets as high as one percent.

Run Several GOFR Algorithms

Here, you can run whatever you want.

#' This is the one we used before.
gofr_of_maps_original <- function(pops_map, pops_bimep, normalize = FALSE) {
  if (normalize) {
    pops_bimep <- pops_bimep / sum(pops_bimep, na.rm = TRUE)
    pops_map <- pops_map / sum(pops_map, na.rm = TRUE)
  }
  pixel_cnt <- sum(is.finite(pops_map))
  # Ignore NA because they tell us where there is water. Rasters are designed to agree on water location.
  pixel_error <- sum((pops_map - pops_bimep)^2, na.rm = TRUE)
  list(
    num = pixel_error,
    den = sum((pops_map - mean(pops_bimep, na.rm = TRUE))^2, na.rm = TRUE),
    pixel_cnt = pixel_cnt
  )
}


relerr <- function(observed, expected) ((observed - expected) / expected)


# This divides sum of squared error by variance of the gold standard
gofr_of_maps_gold_variance <- function(pops_map, pops_bimep, normalize = FALSE) {
  if (normalize) {
    pops_bimep <- pops_bimep / sum(pops_bimep, na.rm = TRUE)
    pops_map <- pops_map / sum(pops_map, na.rm = TRUE)
  }
  pixel_cnt <- sum(is.finite(pops_map))
  # Ignore NA because they tell us where there is water. Rasters are designed to agree on water location.
  pixel_error <- sum((pops_map - pops_bimep)^2, na.rm = TRUE)
  sample_variance <- var(pops_bimep, na.rm = TRUE) * (pixel_cnt - 1)
  similar_variance <- sum((pops_bimep - mean(pops_bimep, na.rm = TRUE))^2, na.rm = TRUE)
  stopifnot(abs(relerr(similar_variance, sample_variance)) < 1e-4)
  list(
    num = pixel_error,
    den = sample_variance,
    pixel_cnt = pixel_cnt
  )
}

# This divides sum of squared error by variance of the map under observation.
gofr_of_maps_observed_variance <- function(pops_map, pops_bimep, normalize = FALSE) {
  if (normalize) {
    pops_bimep <- pops_bimep / sum(pops_bimep, na.rm = TRUE)
    pops_map <- pops_map / sum(pops_map, na.rm = TRUE)
  }
  pixel_cnt <- sum(is.finite(pops_map))
  # Ignore NA because they tell us where there is water. Rasters are designed to agree on water location.
  pixel_error <- sum((pops_map - pops_bimep)^2, na.rm = TRUE)
  list(
    num = pixel_error,
    den = var(pops_map, na.rm = TRUE) * pixel_cnt,
    pixel_cnt = pixel_cnt
  )
}

# Each map at its original resolution. Not BIMEP because we compare
# against BIMEP.
original_resolutions <- c(
  "HRSL on HRSL fine", "GPW on LandScan coarse",
  "LandScan on LandScan coarse", "WorldPop-C on WorldPop fine",
  "WorldPop-U on WorldPop fine")
coarse_resolutions <- gsub("fine", "coarse", original_resolutions[grep("fine", original_resolutions)])

Maps with Original Resolutions, some fine, some coarse

Maps of counts

These tables:

  1. Three rows of un-normalized sum-of-squared errors.
  2. Three rows of GOFR using absolute values.
  3. Three rows of GOFR using normalized values.
# Set to maps or density for original map or density surface.
gt <- gofr_table(maps, gofr_of_maps_original, original_resolutions)
#(gt[1:6, ])
gt[7:nrow(gt), ]
gt[order(paste(gt$source, gt$resolution)),]

This is part of the paper graphic. original resolutions for all. maps of counts, the gold-standard variance.

original_paper_gofr <- gofr_table(maps, gofr_of_maps_gold_variance, original_resolutions)
original_paper_gofr
gofr_table(maps, gofr_of_maps_observed_variance, original_resolutions)

Maps of densities

Now do the same for densities.

gofr_table(density, gofr_of_maps_original, original_resolutions)
gofr_table(density, gofr_of_maps_gold_variance, original_resolutions)
gofr_table(density, gofr_of_maps_observed_variance, original_resolutions)

Maps with coarse resolutions

Maps of counts

gofr_table(maps, gofr_of_maps_original, coarse_resolutions)

This is the one that goes in the paper. maps of counts, the gold-standard variance.

gt <- gofr_table(maps, gofr_of_maps_gold_variance, coarse_resolutions)
gt

Use this to make the table:

paper_gofr_all <- rbind(original_paper_gofr, gt)
coarse_gofr <- paper_gofr_all[paper_gofr_all$resolution == "coarse",]
coarse_gofr
absolute_gofr <- coarse_gofr[startsWith(row.names(coarse_gofr), "absolute"),]
absolute_gofr
normed_gofr <- coarse_gofr[startsWith(row.names(coarse_gofr), "normed"),]
paper_gofr_all
source_order <- c("LandScan", "HRSL", "WorldPop-C", "WorldPop-U", "GPW")
source_idx <- vapply(1:length(source_order), function(i) which(absolute_gofr$source == source_order[i]), FUN.VALUE = integer(1))
correct_order <- rbind(absolute_gofr[source_idx,], normed_gofr[source_idx,])
short_names = c(LandScan = "LS", HRSL = "HRSL", "WorldPop-C" = "WP-C", "WorldPop-U" = "WP-U", GPW = "GPW")
correct_order$name <- short_names[correct_order$source]
for_show <- correct_order[, c("name", "Bioko", "Baney", "Luba", "Malabo", "Riaba")]
row.names(for_show) <- NULL
t(for_show)

knitr::kable(t(for_show), format = "latex") knitr::kable(pop_coarse_gold[, !names(pop_coarse_gold) %in% "resolution"], format = "latex").

gofr_table(maps, gofr_of_maps_observed_variance, coarse_resolutions)

Maps of densities

Now do the same for densities.

gofr_table(density, gofr_of_maps_original, coarse_resolutions)
gofr_table(density, gofr_of_maps_gold_variance, coarse_resolutions)
gofr_table(density, gofr_of_maps_observed_variance, coarse_resolutions)

Variances are Very Different

Let's plot histograms of the different BIMEP maps.

choices <- c("BIMEP on HRSL coarse", "BIMEP on LandScan coarse", "BIMEP on WorldPop coarse")
plots <- list()
for (choice in 1:length(choices)) {
  v1 <- raster::getValues(maps[[choices[choice]]])
  v1_finite <- v1[is.finite(v1)]
  v1_nonzero <- v1_finite[v1_finite > 0]
  df <- data.frame(value = v1_nonzero)
  plots[[choice]] <- ggplot(df, aes(x = value)) +
    stat_density(aes(y=..count..), color = "black", fill = NA) +
    scale_x_continuous(trans = "log10")
}
plots[[1]]
plots[[2]]
plots[[3]]
shift_cnt = 10
shifts <- (
  t(matrix(c(rep(1:shift_cnt, shift_cnt), rep(1:shift_cnt, each = shift_cnt)), ncol = 2)) - 1) /
  shift_cnt
grid <- maps[["LandScan on LandScan coarse"]]
# This is the geolocated data minus outliers to the admin unit.
geolocated <- sf::st_read(rprojroot::find_package_root_file("inst/extdata/bimep_gps.shp"))
variances <- numeric(ncol(shifts))
for (shift_idx in 1:ncol(shifts)) {
  dr <- shifts[, shift_idx] * c(raster::xres(grid), raster::yres(grid))
  shifted_grid <- raster::shift(grid, dx = dr[1], dy = dr[2])
  gridded <- geolocated_on_grid(shifted_grid, geolocated, bioko_sf)
  variances[shift_idx] <- var(raster::getValues(gridded), na.rm = TRUE)
}
var_df <- data.frame(variance = variances)
ggplot(var_df, aes(x = variance)) +
  stat_density(aes(y = ..count..))
ggplot(var_df, aes(x = variance)) +
  geom_histogram()
(c(min(variances), max(variances), mean(variances), sd(variances)))
(max(variances) - min(variances)) / mean(variances)
plot(gridded)

SSD Profiles (From Dave's Notes)

This section is included from the Accuracy Metrics vignette. I don't have the data here, but the notes might help somebody.

Population distributions describe how many people live in each place. For the following, let $P_i (x)$ denote the value of the i$^{th}$ distribution map at a point in space $x$ defined over a space, $X$, where $X$ is the area grid or the sectors grid. We consider comparisons of maps over the same grid space $X$. To do so, we based comparisons on two aspects. First, we measured accuracy by land area and by population density. The difference between land area and population density is that the former measures the distribution of population density of the units, while the second measures the same population density weighted by population. In other words, the accuracy of land area summaries of human population distributions look at the distribution of the population density over land (\textit{i.e.} How is the population distributed at a typical point in space?) whereas the accuracy of population density summaries of human population distributions look at the distribution of the properties of the people (\textit{i.e.} How is population distributed for a typical person?). We considered the cumulative distribution function (CDF), $F_i$, where $y$ is sorted for some particular map over the range of all the values in $P_i(x)$ for $x \in X$.

%We compare two distributions using the Cramer-von Mises (CvM) test: %$$\sum_{y} \left(F_i\left(y\right)-F_j\left(y\right)\right)^2 w(y) dx. $$ %The CvM distribution test is a useful first-order metric for measuring accuracy. In general, we will take $w(y)=1$, but we can also choose $w(y)$ weights to emphasize the accuracy of maps in rural or urban settings.

Another simple measure of the difference between two maps is %the unweighted CvM, which is the sum of the sum of squared differences of distributions (SSD): $$\sum_{x\in X} \left(P_i \left(x\right)-P_j\left(x\right)\right)^2$$ One reference point for comparison is against a map that takes the mean value everywhere (i.e. an information-less map, in which each pixel has the same population density, the total population divided by total land area). The overall variance of the population distribution is: $$V = \sum_{x\in X} \left(P_i \left(x\right)-\left\right)^2 $$ By comparison, we can examine the SSD of any map and the gold standard, $P"$: $$D = \sum_{x\in X} \left(P_i \left(x\right)-P"\left(x\right)\right)^2 $$ Maps that are close to the truth will have a lower SSD. A single number summary is the ratio of the SSD to the total variance, $D/V$. We call $D/V$ the “SSD ratio”, noting it is possible for a map to be so bad that $D/V > 1$. To set some standards for interpreting metrics, we compared the SSD ratio of population maps at 1x1 km resolution to population maps aggregated at 5x5 km, as well as at the community and district levels, and for the whole island. various levels administrative units. By comparing the SSD by granularity, there is a basis for assessing the accuracy of maps developed using different methodologies. We can use the deviance test for a set of nested population surfaces to compare how well some particular high resolution map compares to a coarser version of the map. For example, how much more accurate is a 1x1 km map than a map of population by administrative unit? For any set of nested maps, we can compare the accuracy gain by granularity by looking at the deviance ratio for successively mapped populations.

With a gold standard, the weighted SSD can be used to develop a tailored measure of accuracy: $$D_w = \sum_{x\in X} \left(P_i \left(x\right)-P"\left(x\right)\right)^2 w(P"(x)) $$ These two different metrics can be used diagnostically: when an accurate map is misaligned, it will have the proper CDF, but it could appear to be highly inaccurate by the SSD ratio test.

##### GET RELATIVE POPULATION FIGURES
mcpoptot <- sum(popk$pop1k)
wppoptot <- sum(popk$wppop)
lspoptot <- sum(popk$lspop)
fbpoptot <- sum(popk$popfb)
popk$pop1kp <- popk$pop1k / mcpoptot
popk$wppopp <- popk$wppop / wppoptot
popk$lspopp <- popk$lspop / lspoptot
popk$fbpopp <- popk$popfb / fbpoptot

##### BUILD THE 5KM GRID

km5 <- read.csv(fs::path(params$source_data, "5x5grid.csv"), sep = ",")
x <- c("areaId", "fiveId")
names(km5) <- x

popk <- merge(popk, km5, by = "areaId", all.x = T)

pop5k <- aggregate(list(mcpop = popk$pop1kp, wppop = popk$wppopp, lspop = popk$lspopp, popfb = popk$fbpopp),
  by = list(fiveId = popk$fiveId),
  FUN = mean
)

popk <- merge(popk, list(
  fiveId = pop5k$fiveId, mcpop5 = pop5k$mcpop, wppop5 = pop5k$wppop, lspop5 = pop5k$lspop,
  fbpop5 = pop5k$popfb
), by = "fiveId", all.x = T)

##### BUILD THE ADMIN4 LAYER

ad4 <- read.csv(fs::path(params$source_data, "AreaByComm.csv"), sep = ",")
x <- c("areaId", "admin4", "admin4Id")
names(ad4) <- x
ad4_2 <- read.csv("AreaByNPad4.csv", sep = ",") ### missed points due to National Parks, except for 3 (areaId 94, 1025 and 2136)
x <- c("areaId", "admin4Id")
names(ad4_2) <- x
ad4_2$admin4 <- "NP-corrected"
ad4_2$admin4Id <- ifelse(ad4_2$admin4Id == "", "NN", as.character(ad4_2$admin4Id))

ad4_3 <- rbind(ad4, ad4_2)
ad4_3$dup <- duplicated(ad4_3$areaId)
ad4_3 <- ad4_3[-which(ad4_3$dup == T), ]

popk <- merge(popk, ad4_3[, 1:3], by = "areaId", all.x = T)

popad4 <- aggregate(list(mcpop = popk$pop1kp, wppop = popk$wppopp, lspop = popk$lspopp, popfb = popk$fbpopp),
  by = list(admin4Id = popk$admin4Id),
  FUN = mean
)

popk <- merge(popk, list(
  admin4Id = popad4$admin4Id, mcpopad4 = popad4$mcpop, wppopad4 = popad4$wppop, lspopad4 = popad4$lspop,
  fbpopad4 = popad4$popfb
), by = "admin4Id", all.x = T)

##### BUILD THE ADMIN2 LAYER

ad2 <- read.csv(fs::path(params$source_data, "AreaByDistrict.csv"), sep = ",")
x <- c("areaId", "admin2", "admin2Id", "admin2type")
names(ad2) <- x

popk <- merge(popk, ad2, by = "areaId", all.x = T)

popad2 <- aggregate(list(mcpop = popk$pop1kp, wppop = popk$wppopp, lspop = popk$lspopp, popfb = popk$fbpopp),
  by = list(admin2 = popk$admin2),
  FUN = mean
)

popk <- merge(popk, list(
  admin2 = popad2$admin2,
  mcpopad2 = popad2$mcpop,
  wppopad2 = popad2$wppop,
  lspopad2 = popad2$lspop,
  fbpopad2 = popad2$popfb
), by = "admin2", all.x = T)
ssdRatioF <- function(map, gold, norm = TRUE) {
  # 0 is perfect
  # <1 implies the map is an improvement
  # >1 is awful
  # Inf assigns values to empty places
  # -1
  if (sum(gold) > 0) {
    if (norm == TRUE) {
      gold <- gold / sum(gold)
      map <- map / sum(map)
    }
    R <- sum((map - gold)^2) / sum((gold - mean(gold))^2)
  } else {
    if (sum(map) == 0) {
      R <- 0
    } else {
      R <- -log10(sum(map)) # Inf
    }
  }
  return(R)
}

ssdRatio <- function(map, gold, P = NULL, norm = TRUE) {
  if (is.null(P)) {
    RR <- ssdRatioF(map, gold, norm)
  } else {
    ids <- unique(P)
    L <- length(ids)
    RR <- rep(-1, L)
    for (i in 1:L) {
      ix <- which(ids[i] == P)
      if (length(ix) > 0) RR[i] <- ssdRatioF(map[ix], gold[ix], norm)
    }
    RR <- data.frame(RR, names = ids)
  }
  return(RR)
}

ssdRatioT <- function(map, gold, P = NULL) {
  gold <- gold / sum(gold)
  map <- map / sum(map)
  if (is.null(P)) {
    goldMeans <- mean(gold)
  } else {
    goldMeans <- 0 * gold
    ids <- unique(P)
    L <- length(ids)
    for (i in 1:L) {
      ix <- which(ids[i] == P)
      if (length(ix) > 0) {
        goldMeans[ix] <- mean(gold[ix])
      }
    }
  }
  sum((map - gold)^2) / sum((gold - goldMeans)^2)
}
normed <- c(fb = ssdRatio(popk$popfb, popk$pop1k), wp = ssdRatio(popk$wppop, popk$pop1k), ls = ssdRatio(popk$lspop, popk$pop1k), bimep = ssdRatio(popk$pop1k, popk$pop1k), std = ssdRatio(mean(popk$pop1k) + 0 * popk$pop1k, popk$pop1k))

asis <- c(fb = ssdRatio(popk$popfb, popk$pop1k, norm = FALSE), wp = ssdRatio(popk$wppop, popk$pop1k, norm = FALSE), ls = ssdRatio(popk$lspop, popk$pop1k, norm = FALSE), bimep = ssdRatio(popk$pop1k, popk$pop1k, norm = FALSE), std = ssdRatio(mean(popk$pop1k) + 0 * popk$pop1k, popk$pop1k, norm = FALSE))
cbind(normed, asis)
ix <- which(popk$lspop > 20000)
c(
  ssdRatio(popk$lspop[-ix], popk$pop1k[-ix]),
  ssdRatio(popk$lspop[-ix], popk$pop1k[-ix], popk$admin2)
)
length(ix)
c(
  ssdRatio(popk$lspop[-ix], popk$pop1k[-ix], norm = FALSE),
  ssdRatio(popk$lspop[-ix], popk$pop1k[-ix], popk$admin2, norm = FALSE)
)
length(ix)
A <- data.frame(cbind(
  "HRSL" = c(ssdRatio(popk$popfb, popk$pop1k), ssdRatio(popk$popfb, popk$pop1k, popk$admin2)[1:5, 1]),
  "Landscan" = c(ssdRatio(popk$lspop, popk$pop1k), ssdRatio(popk$lspop, popk$pop1k, popk$admin2)[1:5, 1]),
  "WorldPOP" = c(ssdRatio(popk$wppop, popk$pop1k), ssdRatio(popk$wppop, popk$pop1k, popk$admin2)[1:5, 1])
),
row.names = paste(c("Bioko Island", paste(ssdRatio(popk$wppop, popk$pop1k, popk$admin2)[1:5, 2])))
)
A[6, ] <- -A[6, ]
B <- data.frame(cbind(
  "HRSL" = c(ssdRatio(popk$popfb, popk$pop1k, norm = FALSE), ssdRatio(popk$popfb, popk$pop1k, popk$admin2, norm = FALSE)[1:5, 1]),
  "Landscan" = c(ssdRatio(popk$lspop, popk$pop1k, norm = FALSE), ssdRatio(popk$lspop, popk$pop1k, popk$admin2, norm = FALSE)[1:5, 1]),
  "WorldPOP" = c(ssdRatio(popk$wppop, popk$pop1k, norm = FALSE), ssdRatio(popk$wppop, popk$pop1k, popk$admin2, norm = FALSE)[1:5, 1])
),
row.names = paste(c("Bioko Island", paste(ssdRatio(popk$wppop, popk$pop1k, popk$admin2)[1:5, 2])))
)
B[6, ] <- -B[6, ]
B
signif(cbind(B, A), 3)
Bioko <- c(
  ssdRatioT(popk$popfb, popk$pop1k),
  ssdRatioT(popk$lspop, popk$pop1k),
  ssdRatioF(popk$wppop, popk$pop1k)
)

Admin2 <- c(
  ssdRatioT(popk$popfb, popk$pop1k, popk$admin2),
  ssdRatioT(popk$lspop, popk$pop1k, popk$admin2),
  ssdRatioT(popk$wppop, popk$pop1k, popk$admin2)
)



fiveID <- c(
  ssdRatioT(popk$popfb, popk$pop1k, popk$fiveId),
  ssdRatioT(popk$lspop, popk$pop1k, popk$fiveId),
  ssdRatioT(popk$wppop, popk$pop1k, popk$fiveId)
)


Admin4 <- c(
  ssdRatioT(popk$popfb, popk$pop1k, popk$admin4),
  ssdRatioT(popk$lspop, popk$pop1k, popk$admin4),
  ssdRatioT(popk$wppop, popk$pop1k, popk$admin4)
)


data.frame(cbind(Admin4, fiveID, Admin2, Bioko), row.names = c("Facebook", "Landscan", "WorldPOP"))
# data.frame(cbind(fiveID, Admin2, Bioko), row.names = c("Facebook", "Landscan", "WorldPOP"))


dd-harp/population_comparison_bioko documentation built on Feb. 28, 2021, 11:05 p.m.