In dd-harp/population_comparison_bioko: Compares population estimates for Bioko Island, Equatorial Guinea

From r system("git config --get remote.origin.url", intern = TRUE) on r date(), generated by r Sys.info()[["effective_user"]].

Overview

Both BIMEP gold standard population data and maps from LandScan, WorldPop, and HRSL, were aligned and formatted by the get-data vignette. This notebook reads those datasets and calculates metrics from a population density map by comparing them with the BIMEP map.

library(dplyr)
library(ggplot2)
library(maptools)
library(plotrix)
library(sf)
library(sp)
library(spatstat)
#library(viridis) # Crashing with R 4.0.0.

library(popbioko)

knitr::opts_chunk$set(
  dpi = 150
)

Maps

The maps were aligned and formatted by the "get-data" vignette. We load them all at once here for convenience. The naming scheme is dataset-on-grid, where grids are 100m or 1km, according to where they come from.

These maps are estimates of the number of people in each pixel.

data_dir <- params$data_directory
map_root <- fs::path(data_dir, "aligned")
files <- list.files(map_root, pattern = "*.tif$")
files_df <- filenames_to_description(files)
maps <- lapply(files, function(x) raster::raster(fs::path(map_root, x)))
names(maps) <- rownames(files_df)
files_df

There are 16 rows for (5 data sources) x (2 resolutions each) x (data source and BIMEP comparison).

We will need some canonical projections, too. Universal Transverse Mercator (UTM) is a projection that will give us the island measured in meters.

utm_projection <- "+proj=utm +zone=32N +ellps=WGS84  +no_defs +units=m +datum=WGS84"
bioko_sf <- sf::st_read(fs::path(data_dir, "source", "bioko.shp"))

Summary Statistics

The first figure is summary statistics, done at the pixel level.

Urban Fraction

How should we calculate urban fraction? There isn't a single, standard definition. Countries each choose their own ways to assess the important demographic movement from rural to urban. These assessments often combine population density, administrative boundaries, resource availability (like sewer and water), and functional use patterns. We have one kind of data here, population density, and we're interested in how this data contributes to more nuanced urban metrics, but let's stick to density.

Equatorial Guinea defines urban population density as 1500 people per square kilometer. That's 1500 people per pixel for a grid with 1 km pixels and 15 people per pixel for a grid with 100 m pixels. This is equivalent to r 1500 / (10^3)^2 people per meter squared.

We know we want to measure population density, but there are three ways to measure population density.

Count each pixel that reaches the threshold.
Count each pixel that has 1500 people within a circle whose area is a square kilometer.
Use a kernel density estimator to treat pixels as samples from a population density surface.

Raw Urban Fraction

We measure pixel threshold as a number of people per square meter. We need to know the size of each pixel in order to do that. It helps to know how many pixels we're talking about, so we return the numerator and denominator of each fraction.

We can reuse this function for the kernel density estimation by applying it to the estimated density instead of the raw density.

Kernel Density Estimation

Kernel density estimation is a way to determine density from a point set. We have a grid, not points, so we take a few steps.

Convert the grid into a rate per unit area.
Sample points from that grid using a poisson process for each cell.
Project the points into a plane measured in meters.
Estimate percent urban from the resulting points.

This work will use the raster and spatstat packages for statistics, and it will use the proj4 package to project the points from latitude-longitude to UTM.

pop_raster <- maps[["LandScan on LandScan coarse"]]
sum(raster::values(pop_raster) > 1500, na.rm = TRUE) / nrow(pop_raster) / ncol(pop_raster)

It will help to use the Bioko outlines for windowing in spatstat. You need to translate sf to sp to sp geometry to spatstat owin.

bioko_sp <- as(bioko_sf, Class = "Spatial")
bioko_sp_polygon <- as(bioko_sp, "SpatialPolygons")
# Cannot make an owin from an unprojected space. Apparently.
bioko_proj_sf <- sf::st_transform(bioko_sf, crs = utm_projection)
# The projected space will have points outside. Let's make a little
# buffer to catch those points.
meters <- 1
bioko_buffer_sf <- sf::st_buffer(bioko_proj_sf, 500 * meters)
bioko_proj_sp <- as(bioko_proj_sf, Class = "Spatial")
bioko_proj_polygon_sp <- as(bioko_proj_sp, "SpatialPolygons")
bioko_owin <- as.owin(bioko_proj_polygon_sp)

Landscan measures people in thousands, so urban is pixels above r 1500 / (10^3)^2 / 1000. The population density is on a map in meters, so let's count the pixel size in square meters.

Let's make population maps for every incoming map. This step approximates the map values by projecting them from lat-long to UTM coordinates. It uses bilinear interpolation. We avoid this kind of approximation through most of our work. This step is less sensitive to interpolation because it will construct point process models during the population_count_estimator.

kde_dir <- fs::path(params$data_directory, "kde_raster")
if (!dir.exists(kde_dir)) {
  dir.create(kde_dir)
}
for (map_idx in 1:nrow(files_df)) {
  filename <- fs::path(kde_dir, files_df[map_idx, "filename"])
  if (!file.exists(filename)) {
    # or method = "ngb" for nearest-neighbor
    proj_raster <- raster::projectRaster(maps[[map_idx]], crs = utm_projection, method = "bilinear")
    # Bilinear interpolation, which can cause negative values.
    proj_raster <- raster::clamp(proj_raster, lower = 0)
    pop_density_im <- population_count_estimator(proj_raster)
    pop_density_raster <- im_to_raster(pop_density_im, utm_projection)
    raster::writeRaster(pop_density_raster, filename = filename, format = "GTiff")
  }
}

This makes density maps in a way that's very clear. For each pixel, it averages density within a circular, 1 square km neighborhood.

density_dir <- fs::path(params$data_directory, "density_raster")
if (!dir.exists(density_dir)) {
  dir.create(density_dir)
}
for (map_idx in 1:nrow(files_df)) {
  filename <- fs::path(density_dir, files_df[map_idx, "filename"])
  if (!file.exists(filename)) {
    # or method = "ngb" for nearest-neighbor
    proj_raster <- raster::projectRaster(maps[[map_idx]], crs = utm_projection, method = "ngb")
    pop_density_raster <- density_from_disc(proj_raster)
    raster::writeRaster(pop_density_raster, filename = filename, format = "GTiff")
  }
}

The density maps are saved as GeoTIFFs in the density_raster subdirectory, so load them here before making the summary table. Loading from density_dir means we are using the disc.

density <- lapply(files_df$filename, function(x) raster::raster(fs::path(density_dir, x)))
names(density) <- names(maps)

Construct the Summary Statistics Table

This function makes the first table in the paper. It makes data for two urban densities, 1000 people per square km and 1500 people per square km. We use the former to compare with the current numbers, but the latter is what Equatorial Guinea uses to define urban.

summary_help <- function(urban_cutoff) {
  function(map_idx) {
    popbioko::summary_statistics(
      maps[[map_idx]], density[[map_idx]], urban_per_kilometer_sq = urban_cutoff
      )
  }
}


make_summary_df <- function(urban_cutoff) {
  summary_stats <- lapply(1:length(maps), summary_help(urban_cutoff))
  summary_list <- do.call(rbind, summary_stats)
  rownames(summary_list) <- names(maps)
  summary_df <- as.data.frame(summary_list)
  summary_df["name"] <- rownames(summary_list)
  summary_df <- merge(summary_df, files_df, by = "name")
}
summary1000_df <- make_summary_df(1000)
summary1500_df <- make_summary_df(1500)
names(summary1500_df)

Those dataframes are both strings and numeric. When we rotate them, they will be all strings, so let's do the conversion by hand.

for_show <- function(summary_df) {
  urban_raw_percent <- as.numeric(summary_df[, "urban_raw"]) * 100
  urban_percent <- as.numeric(summary_df[, "urban_fit"]) * 100
  data.frame(
    source = summary_df$source,
    grid = summary_df$grid,
    resolution = as.character(sprintf("%.1f", summary_df$side_meters)),
    total = sprintf("%.0f", summary_df$total),
    max_density = sprintf("%.0f", summary_df$max_density),
    empty_percent = sprintf("%.4g", summary_df$empty_percent),
    na_percent = sprintf("%.4g", summary_df$na_percent),
    urban_raw = sprintf("%.4g", urban_raw_percent),
    urban_fit = sprintf("%.4g", urban_percent),
    pareto_fraction = sprintf("%.4g", 100 * as.numeric(summary_df$pareto_fraction)),
    stringsAsFactors = FALSE
  )
}
show1000_df <- for_show(summary1000_df)
show1500_df <- for_show(summary1500_df)

Comparison Tables

Compare two urban fractions

We're looking for trends that show there are fewer urban areas when we raise the urban criteria from 1000 people to 1500 people per square kilometer.

as.data.frame(t(show1000_df[show1000_df$grid == "HRSL", ]))

as.data.frame(t(show1500_df[show1500_df$grid == "HRSL", ]))

Compare maps of raw and estimated urban fractions

We would like to see whether the estimated urban fraction has a reasonable-looking gaussian kernel size.

sq_m_per <- c(square_meters_per_pixel.raster(maps[[2]]), square_meters_per_pixel.raster(density[[2]]))
sq_m_per

cutoffs <- data.frame(urban_per_meter_sq = c(1000, 1500) / 10^6)
cutoffs["raw"] <- cutoffs[["urban_per_meter_sq"]] * square_meters_per_pixel.raster(maps[[2]])
cutoffs["density"] <- cutoffs[["urban_per_meter_sq"]] * square_meters_per_pixel.raster(density[[2]])
cutoffs

Plot these with cutoffs to see that the maps are similar.

plot(maps[[2]] / cutoffs[1, "raw"])
plot(maps[[2]] / cutoffs[2, "raw"])
plot(density[[2]] / cutoffs[1, "density"])
plot(density[[2]] / cutoffs[2, "density"])

Compare across source grids

This shows all HRSL grid together, then all WorldPop, then all LandScan.

as.data.frame(t(show1000_df[show1000_df$grid == "HRSL", ]))

as.data.frame(t(show1000_df[show1000_df$grid == "WorldPop", ]))

as.data.frame(t(show1000_df[show1000_df$grid == "LandScan", ]))

Compare across resolutions

This groups data with the same resolution.

as.data.frame(t(show1000_df[as.numeric(show1000_df$resolution) < 500, ]))

as.data.frame(t(show1000_df[as.numeric(show1000_df$resolution) > 500, ]))

display_order <- data.frame(
  source = c("BIMEP", "HRSL", "LandScan", "WorldPop-C", "WorldPop-U", "GPW", "BIMEP", "HRSL", "WorldPop-C", "WorldPop-U"),
  granularity = c(rep("coarse", 6), rep("fine", 4)),
  grid = c("HRSL", "HRSL", "LandScan", "WorldPop", "WorldPop", "LandScan", "HRSL", "HRSL", "WorldPop", "WorldPop")
)
short_source <- c(
  BIMEP = "BIMEP", HRSL = "HRSL", LandScan = "LS", "WorldPop-C" = "WP-C",
  "WorldPop-U" = "WP-U", GPW = "GPW")
show1000_df$granularity <- ifelse(as.numeric(show1000_df$resolution) < 500, rep("fine", nrow(show1000_df)), rep("coarse", nrow(show1000_df)))
show1000_df
display_order
do1000_df <- merge(display_order, show1000_df, by = c("source", "granularity", "grid"), sort = TRUE)
stopifnot(nrow(do1000_df) == nrow(display_order))
ord1000_df <- do.call(rbind, lapply(1:nrow(display_order), function(row_idx) {
  with(do1000_df, do1000_df[source == display_order[row_idx, "source"] & granularity == display_order[row_idx, "granularity"], ])
}))
col_ord_df <- ord1000_df[, c("source", "resolution", "total", "max_density", "empty_percent", "urban_raw", "pareto_fraction")]
col_ord_df$source <- short_source[col_ord_df$source]
col_ord_df

Use this on command line: knitr::kable(t(col_ord_df), format = "latex")

Error Bounds

Carlos asked how much the exact alignment of the grid matters. The gold data is point data. Grids that are shifted slightly will have slightly different values in the tables. So let's shift each major grid a bunch of times and see what the values are.

# With side_len = 10, this can take two hours to run.
error_dir <- fs::path(params$data_directory, "errors")
if (!dir.exists(error_dir)) {
  dir.create(error_dir)
}
side_len <- 10
for (grid in c("landscan", "hrsl", "worldpop")) {
  stats_file <- fs::path(error_dir, paste(grid, "csv", sep = "."))
  if (!file.exists(stats_file)) {
    popbioko::summary_statistics_multiple_grids(grid, side_len, stats_file)
  } else {
    warning(paste("file already exists", stats_file))
  }
}

error_dir <- fs::path(params$data_directory, "errors")
csvs <- do.call(
  rbind,
  lapply(
    list.files(error_dir, pattern = ".csv$", full.names = TRUE),
    function(csv) {read.csv(csv, header = TRUE, stringsAsFactors = TRUE)}
    ),
)

I'm curious how much urban fit varies. What I see is that the mean doesn't change much but the more coarse grids have a larger standard deviation.

ggplot(csvs, aes(x = urban_fit, group = grid)) +
  geom_histogram(bins = 50) +
    facet_grid(grid ~ .)

The percent of NA values is the size of the border around the island. This is kind of a control. We see it doesn't vary a lot.

ggplot(csvs, aes(x = na_percent, group = grid)) +
  geom_histogram(bins = 200) +
    facet_grid(grid ~ .)

I don't know how I expect pareto fraction to vary as the grid size changes. I feel kind of surprised. I think this is a result of having small statistics and using a non-statistical estimator.

ggplot(csvs, aes(x = pareto_fraction, group = grid)) +
  geom_histogram(bins = 200) +
    facet_grid(grid ~ .)

The maximum pixel value, of the population, not the population density. It looks like there's more noise in the coarse grid.

ggplot(csvs, aes(x = max_density, group = grid)) +
  geom_histogram(bins = 200) +
    facet_grid(grid ~ .)

Let's look more closely by dividing by the mean in each case. We see that they all have about the same uncertainty in the maximum when you divide by the mean.

mean_max <- aggregate(max_density ~ grid, FUN = mean, data = csvs)
names(mean_max)[names(mean_max) %in% "max_density"] <- "mean_max"
with_mean <- merge(csvs, mean_max)
with_mean["nmax"] <- with_mean$max_density / with_mean$mean_max
ggplot(with_mean, aes(x = nmax, group = grid)) +
  geom_histogram(bins = 50) +
    facet_grid(grid ~ .)

OK, so the plots aren't terribly skewed. We should use regular confidence intervals instead of bootstrapping. Maybe small values should use a Wald. Let's make values with confidence intervals for each of these datasets and take a look.

add_ci_to_column <- function(df, column) {
  names(df)[names(df) %in% column] <- "target"
  counted <- aggregate(target ~ grid, data = df, FUN = function(x) sum(!is.na(x)))
  names(counted)[names(counted) %in% "target"] <- "n"

  mean_col <- aggregate(target ~ grid, FUN = mean, data = df)
  names(mean_col)[names(mean_col) %in% "target"] <- "mean_val"

  sd_col <- aggregate(target ~ grid, FUN = sd, data = df)
  names(sd_col)[names(sd_col) %in% "target"] <- "sd_val"

  with_mean <- merge(merge(counted, mean_col), sd_col)
  lower <- paste(column, "low", sep = "_")
  upper <- paste(column, "up", sep = "_")
  with_mean[lower] <- with(with_mean, mean_val - 1.96 * sd_val / sqrt(n))
  with_mean[upper] <- with(with_mean, mean_val + 1.96 * sd_val / sqrt(n))
  names(with_mean)[names(with_mean) %in% "mean_val"] <- column
  with_mean[names(with_mean) %in% c(column, lower, upper, "grid")]
}

add_sd_to_column <- function(df, column) {
  names(df)[names(df) %in% column] <- "target"
  counted <- aggregate(target ~ grid, data = df, FUN = function(x) sum(!is.na(x)))
  names(counted)[names(counted) %in% "target"] <- "n"

  mean_col <- aggregate(target ~ grid, FUN = mean, data = df)
  names(mean_col)[names(mean_col) %in% "target"] <- column

  sd_col <- aggregate(target ~ grid, FUN = sd, data = df)
  sd_name <- paste(column, "sd", sep = "_")
  names(sd_col)[names(sd_col) %in% "target"] <- sd_name
  together <- merge(mean_col, sd_col, by = "grid")
  together[names(together) %in% c(column, sd_name, "grid")]
}

zdf <- data.frame(
  rate = c(rnorm(10, mean = 0.5, sd = 0.2), rnorm(10, mean = 0.7, sd = 0.1)),
  grid = c(rep("HRSL", 10), rep("LandScan", 10)),
  stringsAsFactors = FALSE
)

# A little test
zzdf <- add_ci_to_column(zdf, "rate")
stopifnot(nrow(zzdf) == 2)
stopifnot(all(zzdf$rate_low < zzdf$rate))
stopifnot(all(zzdf$rate < zzdf$rate_up))

zzdf <- add_sd_to_column(zdf, "rate")
zzdf

consider_error <- c("max_density", "empty_percent", "urban_fit", "pareto_fraction")
stopifnot(all(consider_error %in% names(csvs)))
single_ci <- function(col) add_sd_to_column(csvs, col)
ans_df <- single_ci(consider_error[1])
for (col in consider_error[2:length(consider_error)]) {
  ans_df <- merge(ans_df, single_ci(col))
}
t(ans_df)

Discussion

Using estimation to find the urban area matters for the 100m data but not the 1km data.
I'm surprised at how much the BIMEP 100m numbers depend on which 100m grid is used. For instance, the BIMEP on HRSL has a higher empty percent than the others. I could understand this by regridding BIMEP data to a set of evenly-spaced grids.
The implied year for HRSL is the last census, which is older. The total populations are different enough, that maybe I should use a growth rate for correction.
The empty percent was very sensitive to my choice of lower bound for empty. Should I make a graph of how different datasets get close to zero? It's a problem for both WorldPop and LandScan. Let's go with 1 person in an area. It gets rid of clearly spurious nonzeros.
WorldPop provides an unscaled and a scaled version. We're using the scaled version. The unscaled looks a little cleaner, but the population is much too large. This seems like the choice that will lead to better numbers for WP.
The 1km data behaves like a smooth grid. The 100m and 30m data behave like points. I see that b/c the point smoothing makes a big difference on the urban numbers. At what resolution do we treat this data as point data?
Using a kernel density estimator gives the largest estimate for urban. The 1 km^2 disc seems to give the smallest.

Population Scatter Plots

This section makes figure 4, scatter plots of population data versus ground truth population data. The density plots use maps of either resolution to plot estimated density per square km. The population size plots instead use the count of people in a pixel.

Set up consistent colors for the different data sets.

sources_ordered <- c("BIMEP", "HRSL", "LandScan", "WorldPop-C", "WorldPop-U", "GPW")
# Viridis is causing problems in one installation, so let's make our own.
color_component <- function(x) sprintf("%02x", floor(ifelse(x >= 1.0, 255, x * 256.0)))
as_color <- function(xxx) paste(
  c("#", color_component(xxx[1]), color_component(xxx[2]), color_component(xxx[3])), collapse = "")
viridis_mock <- function(n) {
  float_colors <- read.csv(rprojroot::find_rstudio_root_file("vignettes/plasma.csv"))
  key_colors <- sapply(1:nrow(float_colors), function(n) as_color(as.numeric(float_colors[n, ])))
  colorRampPalette(key_colors)(n)
}
#viridis_colors <- viridis::plasma(8, alpha = 1, begin = 0, end = 1)[c(1, 3, 5, 7)]
viridis_colors <- viridis_mock(length(sources_ordered))

colors <- c(viridis_colors, grDevices::grey(0.7))
names(colors) <- c(sources_ordered, "grey")
logarithmic_axis_labels <- c(
  "1", "10", "100", "1,000", expression(10^4), expression(10^5), expression(10^6))

log_offset <- function(offset) {
  function(x) {
    log(x + offset)
  }
}

logoff <- log_offset(1)
symbol_size <- c(0.5, 0.25)
names(symbol_size) <- c("coarse", "fine")
symbol_choice <- c(19, 15)
names(symbol_choice) <- c("coarse", "fine")

check_name <- function(value, which, res, source) {
  if (length(value) != 1) {
    cat(sprintf("Could not find %s-%s-%s\n", which, res, source))
  }
  value
}

map_from_source_resolution <- function(source, resolution, map_list) {
  grid_to_use <- unique(files_df[files_df$source == source, "grid"])
  xy_df <- files_df[files_df$resolution == resolution & files_df$grid == grid_to_use, ]
  x_map_name <- check_name(xy_df[xy_df$source == "BIMEP", "name"], "bimep", resolution, source)
  y_map_name <- check_name(xy_df[xy_df$source == source, "name"], "y", resolution, source)
  gv <- function(a_map_name) {
    if (!a_map_name %in% names(map_list)) {
      msg <- sprintf("%s not in map names", a_map_name)
      cat(paste(msg, "\n"))
      stop(msg)
    }
    raster::getValues(map_list[[a_map_name]])
  }
  x <- gv(x_map_name)
  y <- gv(y_map_name)
  list(x = logoff(x), y = logoff(y))
}


#' This gets population counts, not density.
xy_from_source_resolution <- function(source, resolution) {
  map_from_source_resolution(source, resolution, maps)
}


density_from_source_resolution <- function(source, resolution) {
  map_from_source_resolution(source, resolution, density)
}


pop_scatter <- function(
  y_name, resolution, labels, title, maximum, background = NULL, counts = NULL
  ) {
  plot(
    vector(mode = "numeric", length = 0),
    xlim = c(0, maximum), ylim = c(0, maximum),
    xlab = labels[[1]], ylab = labels[[2]], main = title,
    xaxt = "n", yaxt = "n"
    )
  axis(1, logoff(10^c(0:5)), 10^c(0:5))
  axis(2, logoff(10^c(0:5)), 10^c(0:5))
  segments(logoff(0), logoff(0), logoff(10^5), logoff(10^5))
  segments(logoff(1), logoff(1), logoff(1), logoff(10^5), col = colors["grey"])
  segments(logoff(1), logoff(1), logoff(10^5), logoff(1), col = colors["grey"])

  if (is.null(counts)) {
    data_function <- density_from_source_resolution
  } else {
    data_function <- xy_from_source_resolution
  }

  for (plot_idx in 1:length(y_name)) {
    y_source <- y_name[plot_idx]
    y_resolution <- resolution[plot_idx]
    xy <- data_function(y_source, y_resolution)
    cex <- symbol_size[as.character(y_resolution)]
    pch <- symbol_choice[as.character(y_resolution)]
    if (resolution[plot_idx] == "coarse") {
      color <- colors[y_source]
    } else {
      color <- scales::alpha(colors[y_source], 0.2)
    }
    if (!is.null(background) & plot_idx %in% background) {
      color <- colors["grey"]
    }
    points(xy, col = color, cex = cex, pch = pch)
  }
}

density_label <- c("BIMEP Population Density", "Mapped Population Density")
maximum1k <- 1.01 * max(density_from_source_resolution("LandScan", "coarse")$y, na.rm = TRUE)

scatter1k_together <- function() {
  maximum1k <- 1.01 * max(density_from_source_resolution("LandScan", "coarse")$y, na.rm = TRUE)
  subjects <- c("HRSL", "LandScan", "WorldPop-C", "WorldPop-U",
                "GPW", "HRSL", "WorldPop-C", "WorldPop-U")
  uniq_cnt <- length(unique(subjects))
  resolution <- c(rep("coarse", uniq_cnt), rep("fine", length(subjects) - uniq_cnt))
  pop_scatter(subjects, resolution, density_label, "A", maximum1k)
  uniq <- unique(subjects)
  legend("topleft", legend =  uniq, col = colors[uniq],
             lty = 1, lwd = 2, cex = 1, bg = "white", bty = "n")
}
scatter1k_together()

scatter1k_ls <- function() pop_scatter(
  "LandScan", "coarse",
  density_label, "A) LandScan", maximum1k)
scatter1k_ls()

scatter1k_hrsl <- function() pop_scatter(
  rep("HRSL", 2), c("fine", "coarse"),
  density_label, "B) HRSL", maximum1k, background = 1)
scatter1k_hrsl()

scatter1k_wpc <- function() pop_scatter(
  rep("WorldPop-C", 2), c("fine", "coarse"),
  density_label, "C) WorldPop-C", maximum1k, background = 1)
scatter1k_wpc()

scatter1k_wpu <- function() pop_scatter(
  rep("WorldPop-U", 2), c("fine", "coarse"),
  density_label, "D) WorldPop-U", maximum1k, background = 1)
scatter1k_wpu()

scatter1k_gpw <- function() pop_scatter(
  rep("GPW", 2), c("fine", "coarse"),
  density_label, "E) GPW", maximum1k, background = 1)
scatter1k_gpw()

maximum100 <- 6.2
pop_scatter_label <- c("BIMEP 100x100 m", "Population Size")
scatter1k_hrsl_count <- function() pop_scatter(
  rep("HRSL", 1), c("fine"),
  pop_scatter_label, "F) HRSL, 100x100 m", maximum100, counts = TRUE)
scatter1k_hrsl_count()

maximum100 <- 6.2
pop_scatter_label <- c("BIMEP 100x100 m", "Population Size")
scatter1k_wpc_count <- function() pop_scatter(
  rep("WorldPop-C", 1), c("fine"),
  pop_scatter_label, "G) WP-C, 100x100 m", maximum100, counts = TRUE)
scatter1k_wpc_count()

maximum100 <- 6.2
pop_scatter_label <- c("BIMEP 100x100 m", "Population Size")
scatter1k_wpu_count <- function() pop_scatter(
  rep("WorldPop-U", 1), c("fine"),
  pop_scatter_label, "H) WP-U, 100x100 m", maximum100, counts = TRUE)
scatter1k_wpu_count()

make_figure <- function() {
  #par(mfrow = c(5, 2), mar = c(5, 4.2, 3, 2))
  par(mfrow = c(4, 2), mar = c(4, 3, 3, 2))
  # scatter1k_together()
  scatter1k_ls()
  scatter1k_hrsl()
  scatter1k_wpc()
  scatter1k_wpu()
  scatter1k_gpw()
  scatter1k_hrsl_count()
  scatter1k_wpc_count()
  scatter1k_wpu_count()
}
popbioko::save_plot(make_figure, "scatter1k")
make_figure()

Cumulative Maps

This section builds the cumulative maps in figure 5. It separates the work into one part to generate x and y data and one part to decide axes and such for that data.

linear_from_source_resolution <- function(source, resolution, map_list) {
  grid <- unique(files_df[files_df$source == source, "grid"])
  xy_df <- files_df[files_df$resolution == resolution & files_df$grid == grid, ]
  x_map_name <- xy_df[xy_df$source == "BIMEP", "name"]
  y_map_name <- xy_df[xy_df$source == source, "name"]
  x <- raster::getValues(map_list[[x_map_name]])
  y <- raster::getValues(map_list[[y_map_name]])
  list(x = x, y = y)
}

sources <- c("BIMEP", "HRSL", "LandScan", "WorldPop-C", "WorldPop-U", "GPW")

# This draws each line for the plots, taking care of color and line type.
lines_for_datasets <- function(datasets, linewidth) {
  for (data_idx in 1:length(datasets)) {
    lines(
      datasets[[data_idx]][["x"]],
      datasets[[data_idx]][["y"]],
      type = "l",
      col = colors[datasets[[data_idx]][["source"]]],
      lwd = linewidth,
      lty = ifelse(datasets[[data_idx]][["resolution"]] == "coarse", 1, 2)
      )
  }
}

cdfAreaLines <- function(H, source, resolution, linewidth = 2) {
  H <- sort(H)
  area <-  c(1:length(H)) / length(H)
  lines(
    area,
    cumsum(H) / sum(H),
    col = colors[source],
    lwd = linewidth,
    lty = ifelse(resolution == "coarse", 1, 2)
    )
}

cdfAreaFigure <- function(llwd = 1, label = "") {
  plot(
    vector(mode = "numeric", length = 0),
    xlim = c(0, 1), ylim = c(0, 1),
    xlab =  "% Area", ylab =  "Proportion of Population",
    main = label
    )
  sources <- c("BIMEP", "HRSL", "LandScan", "WorldPop-C", "WorldPop-U", "GPW")
  for (source in sources) {
    for (resolution in c("coarse", "fine")) {
      if (source == "BIMEP") {
        data <- linear_from_source_resolution("HRSL", resolution, density)$x
      } else {
        data <- linear_from_source_resolution(source, resolution, density)$y
      }
      cdfAreaLines(data, source, resolution, linewidth = llwd)
    }
  }
  legend("topleft", legend =  sources, col = colors[sources],
         lty = 1, lwd = 2, cex = 1)
}

cdfAreaFigure(2)

cdfDens <- function(H, log_base) {
  H <- sort(H)
  mx <- ceiling(log(max(H), base = log_base))

  c2 <- 1:mx
  ixz <- which(H == 0)
  Pz <- length(ixz)
  for (i in c2) {
    ix <- which(H < log_base^i) 
    Pz <- c(Pz, sum(H[ix]))
  }
  pop <- c(0, log_base^c2)
  cbind(pop, Pz)
}


logPopAreaLines <- function(H, source, resolution, log_base, linewidth = 2) {
  dd <- cdfDens(H,log_base)[-1,]
  P <- dd[,1]
  H <- dd[,2]
  l <- log10(max(H))

  lines(
    P,
    H / max(H),
    col = colors[source],
    lwd = linewidth,
    lty = ifelse(resolution == "coarse", 1, 2)
    )
}


logPopAreaFigure <- function(llwd = 1, log_base = 2, label = "") {
  plot(
    vector(mode = "numeric", length = 0),
    log = "x",
    xlim = c(1.05, 0.5 * 10^5),
    ylim = c(0, 1),
    xlab =  "log(Pop Dens)",
    ylab =  "Proportion of Population",
    xaxt = "n",
    main = label
    )

  log_max <- 0
  for (source in sources_ordered) {
    for (resolution in c("coarse", "fine")) {
      if (source == "BIMEP") {
        data <- linear_from_source_resolution("HRSL", resolution, density)$x
      } else {
        data <- linear_from_source_resolution(source, resolution, density)$y
      }
      log_max <- max(log_max, log10(max(data, na.rm = TRUE)))
      logPopAreaLines(data, source, resolution, log_base, linewidth = llwd)
    }
  }
  axis(1, 10^(0:(log_max + 1)), logarithmic_axis_labels[1:(log_max + 2)])
  legend("topleft", legend =  sources, col = colors[sources],
         lty = 1, lwd = 2, cex = 1)
}

logPopAreaFigure(2)

Now the third part of Figure 5, a density plot that is just density, not cumulative.

log_density <- function(H, base_power = 2) {
  H <- H[!is.na(H)]
  c2 <- ceiling(max(log(H[H > 0], base = base_power)))
  ixz <- which(H <= 0)
  upper <- 0
  if (length(ixz) > 0) {
    HH <- H[-ixz]
  } else {
    HH <- H
  }
  Pz <- length(ixz)
  for (i in 1:c2) {
    upper <- c(upper, base_power^(i - 1))
    ix <- which(HH < base_power^(i - 1))
    if (length(ix) > 0) {
      Pz <- c(Pz, sum(HH[ix]))
      HH <- HH[-ix]
    } else {
      Pz <- c(Pz, 0)
    }
  }
  while (Pz[length(Pz)] == 0) {
    Pz <- Pz[-length(Pz)]
    upper <- upper[-length(Pz)]
  }
  cbind(round(Pz), upper)
}

single_test <- linear_from_source_resolution("LandScan", "coarse", density)$y
dens_test <- log_density(single_test, base_power = 1.2)


density_kernel_data <- function(base_power, offset = 0) {
  datasets <- vector(mode = "list", length = 2 * length(sources))
  running_idx <- 1
  for (source in sources_ordered) {
    for (resolution in c("coarse", "fine")) {
      if (source == "BIMEP") {
        data <- linear_from_source_resolution("HRSL", resolution, density)$x
      } else {
        data <- linear_from_source_resolution(source, resolution, density)$y
      }
      D <- log_density(data, base_power)[-1, 1]
      l <- length(D)
      x <- base_power^c(c(1:(l + 1))) * base_power^offset
      y <- c(D / sum(data, na.rm = TRUE), 0)
      datasets[[running_idx]] <- list(source = source, resolution = resolution, x = x, y = y, l = l)
      running_idx <- running_idx + 1
    }
  }
  datasets
}


density_kernel_figure <- function(base_power, linewidth = 1, label = "") {
  datasets <- density_kernel_data(base_power)
  l = max(sapply(datasets, function(ds) ds$l))
  plot(
    vector(mode = "numeric", length = 0),
    xaxt = "n",
    log = "x",
    xlim = c(5, base_power^(l + 2)),
    ylim = c(0, 0.4),
    xlab =  "log(Pop Dens)", ylab =  "Proportion of Population",
    main = label
    )
  lines_for_datasets(datasets, linewidth)
  L <- 4
  axis(1, 10^(0:(L + 1)) + 0.5, logarithmic_axis_labels[1:(L + 2)])
  legend("topleft", legend =  sources, col = colors[sources],
         lty = 1, lwd = 2, cex = 1)
}

density_kernel_figure(1.2, 2, "C")

cdf_figure <- function() {
  par(mfrow = c(3, 1), mar = c(5, 5, 3, 2))
  cdfAreaFigure(2, label = "A")
  logPopAreaFigure(2, label = "B")
  density_kernel_figure(1.2, 2, "C")
}
save_plot(cdf_figure, "CDF")
cdf_figure()

Proportion, Accuracy, Recall, and Precision

Figure 6 is proportion, accuracy, recall, and precision.

Define the values

The following functions in compute the specificity and sensitivity of a map compared with a gold standard:

breakpoints <- c(
  empty = 0,
  rural1 = 1,
  rural2 = 50,
  periurban = 250,
  urban = 1000
)
break_colors <- c(
  empty = grey(0.95),
  rural1 = "ivory",
  rural2 = "lavender",
  periurban = "cornsilk",
  urban = "aliceblue"
)
stopifnot(all(names(breakpoints) == names(break_colors)))

lessThanCat <- function(tau, map, gold, km2 = 1) {
  tau <- tau * km2
  accuracy_profile(truth_table(function(observed) {observed < tau}, gold, map))
}

greaterThanCat <- function(tau, map, gold, km2 = 1) {
  tau <- tau * km2
  accuracy_profile(truth_table(function(observed) {tau <= observed}, gold, map))
}

betweenCat <- function(L, U, map, gold, km2 = 1) {
  L <- L * km2
  U <- U * km2
  # Choose comparison so that it is cadlag.
  accuracy_profile(truth_table(function(obs) {L <= obs & obs < U }, gold, map))
}

fullCat <- function(map, gold, km2 = 1) {
  with(as.list(breakpoints), {
    data.frame(as.vector(rbind(
      empty = lessThanCat(rural1, map, gold, km2),
      rural1 = betweenCat(rural1, rural2, map, gold, km2),
      rural2 = betweenCat(rural2, periurban, map, gold, km2),
      periurban = betweenCat(periurban, urban, map, gold, km2),
      urban = greaterThanCat(urban, map, gold, km2)
    )))
  })
}

Proportion

landAA <- function(H, breakpoints, area = 1) {
  stopifnot(class(breakpoints) == "numeric")
  breakpoints <- c(breakpoints, Inf) / area

  Pz <- rep(0, length(breakpoints))
  for (cut_idx in 1:length(breakpoints)) {
    ixz <- which(H < breakpoints[cut_idx])
    Pz[cut_idx] <- length(ixz)
    if (Pz[cut_idx] > 0) {
      H <- H[-ixz]
    }  # Nothing to remove
  }
  Pz / sum(Pz)
}

over_arrays <- function(transform) {
  datasets <- vector(mode = "list", length = 2 * length(sources))
  running_idx <- 1
  for (source in sources) {
    for (resolution in c("coarse", "fine")) {
      if (source == "BIMEP") {
        data <- linear_from_source_resolution("HRSL", resolution, density)$x
      } else {
        data <- linear_from_source_resolution(source, resolution, density)$y
      }
      datasets[[running_idx]] <- list(source = source, resolution = resolution, x = transform(data))
      running_idx <- running_idx + 1
    }
  }
  datasets
}

proportion_data <- function(breakpoints) {
  over_arrays(function(data) {
    landAA(data, breakpoints)
  })
}
example_data <- proportion_data(c(1, 50, 250, 1000))

#' Make shaded backgrounds for all four figures.
#' @param splits The sides of the boxes, where colors change.
make_boxes <- function(splits) {
  stopifnot(length(splits) == length(break_colors) + 1)
  boxes <- vector(mode = "list", length = length(break_colors))
  for (color_idx in 1:length(break_colors)) {
    boxes[[color_idx]] <- list(splits[color_idx], splits[color_idx + 1], break_colors[color_idx])
  }
  makebox <- function(x1, x2, color) {
    xx <- c(x1, x1, x2, x2)
    yy <- c(0,1,1,0)
    polygon(cbind(xx, yy), border = NA, col = color)
  }
  for (box_idx in 1:length(boxes)) do.call(makebox, boxes[[box_idx]])
}


pdfAreaFigure <- function(lb = "") {
  # Start from 2 because this function uses the upper cutoff.
  prop_data <- proportion_data(breakpoints[2:length(breakpoints)])
  plot(
    vector(mode = "numeric", length = 0),
    xlim = c(1, 7),
    ylim = c(0, 1),
    xaxt = "n",
    xlab = "",
    ylab = "Proportion",
    main = lb
    )
  axis(1, c(2, 2:5 + 0.5), c(0,1,50,250,1000))

  make_boxes(1:6 + 0.5)

  offset_idx <- 0
  y_data <- NULL
  for (find_source in sources_ordered) {
    for (data_idx in 1:length(prop_data)) {
      y_data <- with(prop_data[[data_idx]], {
        if ((source == find_source) & (resolution == "coarse")) {
          return(x)
        }
      })
      if (!is.null(y_data)) break
    }
    offset <- -.21 + 0.14 * offset_idx
    stopifnot(!is.null(y_data))
    lines(2:6 + offset, y_data, type = "h", lwd = 5, col = colors[find_source], lty = 1)
    offset_idx <- offset_idx + 1
  }

  legend("topright", legend = sources_ordered, col = colors[sources_ordered],
         lty = 1, lwd = 2, cex = 1, bg = "white")
  #mtext("A", side = 3, line = 0.2, outer = FALSE, cex = 1.6)
}

pdfAreaFigure()

Precision

Precision is also known as positive predictive value, so it's the ppv variable.

get_measure <- function(x, map, gold, measure) {
  if (x == 0) {
    lessThanCat(x, map, gold)[[measure]]
  } else {
    greaterThanCat(x, map, gold)[[measure]]
  }
}
getPPV <- function(x, map, gold) get_measure(x, map, gold, "ppv")

compare_sources <- sources[!sources %in% "BIMEP"]
ppv_data <- sapply(compare_sources, function(x) {
  linear_from_source_resolution(x, "coarse", density)
})
dimnames(ppv_data) <- list(c("x", "y"), compare_sources)
mn <- 1
ppv_mx <- 0.7 * max(ppv_data["y", "LandScan"][[1]], na.rm = TRUE)
tk <- c(breakpoints, 4000, 10000)

xx <- c(0, seq(sqrt(mn), sqrt(ppv_mx), length.out = 100)^2)
xxp <- sqrt(xx)
tk0 <- sqrt(tk)

pw <- 1 / 3.5
xx <- c(0, seq(mn^pw, ppv_mx^pw, length.out = 100)^(1 / pw))
xxp <- xx^pw
tk0 <- tk^pw

ppv_splits <- c(breakpoints, ppv_mx)^pw

generate_accuracy_y <- function(xx, measure) {
  ppv_y <- matrix(0, nrow = length(xx), ncol = length(compare_sources))
  for (source_idx in 1:length(compare_sources)) {
    source <- compare_sources[source_idx]
    gold <- ppv_data["x", source][[1]]
    observed <- ppv_data["y", source][[1]]
    ppv_y[1:length(xx), source_idx] <- as.vector(unlist(sapply(
      xx,
      function(x, map, gold) get_measure(x, map, gold, measure),
      map = observed,
      gold = gold
      )))
  }
  colnames(ppv_y) <- compare_sources
  ppv_y
}
ppv_y <- generate_accuracy_y(xx, "ppv")

PPVProfile <- function(ppv_y, name, lb = "") {
  plot(
    vector(mode = "numeric", length = 0),
    xlim = c(0, max(xxp)),
    ylim = c(0, 1),
    xaxt = "n",
    xlab = "Population Density",
    ylab = name,
    main = lb
    )
  axis(1, tk0, tk)

  make_boxes(ppv_splits)

  for (col_idx in 1:ncol(ppv_y)) {
    source <- colnames(ppv_y)[col_idx]
    lines(xxp, as.numeric(ppv_y[, col_idx]), type = "s", col = colors[source], lwd = 2)
  }
}
PPVProfile(ppv_y, "Precision")

Accuracy

acc_y <- generate_accuracy_y(xx, "acc")
PPVProfile(acc_y, "Accuracy")

Recall

sens_y <- generate_accuracy_y(xx, "sens")
PPVProfile(sens_y, "Sensitivity")

area_figure <- function() {
  par(new, mfrow = c(2, 2), mar = c(4, 3, 2, 2))
  pdfAreaFigure(lb = "A")
  PPVProfile(acc_y, "Accuracy", lb = "B")
  PPVProfile(sens_y, "Sensitivity", lb = "C")
  PPVProfile(ppv_y, "Precision", lb = "D")
}
save_plot(area_figure, "AccuracyProfile")
area_figure()

Table of ranges

This is table 3, which has accuracy, recall, and precision for ranges of values.

HRSL <- fullCat(ppv_data["y", "HRSL"][[1]], ppv_data["x", "HRSL"][[1]])
LS <- fullCat(ppv_data["y", "LandScan"][[1]], ppv_data["x", "LandScan"][[1]])
WP <- fullCat(ppv_data["y", "WorldPop-C"][[1]], ppv_data["x", "WorldPop-C"][[1]])
WPU <- fullCat(ppv_data["y", "WorldPop-U"][[1]], ppv_data["x", "WorldPop-U"][[1]])
GPW <- fullCat(ppv_data["y", "GPW"][[1]], ppv_data["x", "GPW"][[1]])


# ppv <- cbind(HRSL = as.vector(HRSL$ppv), LS = as.vector(LS$ppv), WP = as.vector(WP$ppv))
# sens <- cbind(HRSL = as.vector(HRSL$sens), LS = as.vector(LS$sens), WP = as.vector(WP$sens))
# acc <- cbind(HRSL = as.vector(HRSL$acc), LS = as.vector(LS$acc), WP = as.vector(WP$acc))

ppv <- cbind(HRSL = as.numeric(HRSL$ppv), LS = as.numeric(LS$ppv), WPC = as.numeric(WP$ppv), WPU = as.numeric(WPU$ppv), GPW = as.numeric(GPW$ppv))
sens <- cbind(HRSL = as.numeric(HRSL$sens), LS = as.numeric(LS$sens), WPC = as.numeric(WP$sens), WPU = as.numeric(WPU$sens), GPW = as.numeric(GPW$sens))
acc <- cbind(HRSL = as.numeric(HRSL$acc), LS = as.numeric(LS$acc), WPC = as.numeric(WP$acc), WPU = as.numeric(WPU$acc), GPW = as.numeric(GPW$acc))
accuracy_as_table <- signif(cbind(acc, sens, ppv), 3)
rownames(accuracy_as_table) <- rownames(HRSL)
t(accuracy_as_table)

Remember that you can turn this into LaTex using the command knitr::kable(t(accuracy_as_table), format = "latex").

Goodness of Fit Ratio

The goodness of fit ratio (GOFR) is Dave's metric. It's both described in the paper and in ``Accuracy Metrics'' under the section SSD Profiles. Sean made an important comment about the GOFR, that it doesn't use an estimator where he expected one in a statistical model. He suggested we look at variance calculations and make the GOFR more similar to that.

We need the admin2 for Bioko in order to recreate a table with GOFR for both the island and the four admin2 regions on the island.

# The input map is in UTM, so we transform it to lat-long in order to agree
# with the raster coordinates of all rasters.
admin2_sf <- sf::st_transform(sf::st_read(rprojroot::find_package_root_file(
  "inst/extdata/admin2/bioko_admin2_fulldistricts.shp")),
  raster::crs(maps[1][[1]]))
admin2_st <- as(st_geometry(admin2_sf), "Spatial")
admin2_st_df <- as(admin2_sf[, "OBJECTID"], "Spatial")
tmap::tm_shape(admin2_sf) + tmap::tm_polygons("admin2")

The GOFR itself is Sum-of-squared-errors divided by variance. We want to calculate this for the whole island and for districts on the island. Each GOFR calculation takes two maps as input, a source map, such as HRSL-fine, and a comparison map, which would be BIMEP-on-HRSL-fine-grid. So we will iterate through source maps, taking the source and the comparison map.

calculate_gofr <- function(normalize_gofr, gofr_map_type) {
# For each grid and resolution, there should be a map and a BIMEP version of that map.
admin_names <- c("Baney", "Malabo", "Luba", "Riaba")
# If these are out of order, then the names in admin_names are in the wrong order.
stopifnot(admin2_sf$OBJECTID == 1:4)
gofr <- files_df[files_df$source != "BIMEP", ]
rownames(gofr) <- paste(gofr$source, gofr$resolution)
gofr$GOFR <- numeric(nrow(gofr))
gofr$Baney <- numeric(nrow(gofr))
gofr$Malabo <- numeric(nrow(gofr))
gofr$Luba <- numeric(nrow(gofr))
gofr$Riaba <- numeric(nrow(gofr))

# This is the GOFR equation. This defines it.
gofr_of_maps_gold_variance <- function(pops_map, pops_bimep, normalize = FALSE) {
  if (normalize) {
    pops_bimep <- pops_bimep / sum(pops_bimep, na.rm = TRUE)
    pops_map <- pops_map / sum(pops_map, na.rm = TRUE)
  }
  pixel_cnt <- sum(is.finite(pops_map))
  # Ignore NA because they tell us where there is water. Rasters are designed to agree on water location.
  pixel_error <- sum((pops_map - pops_bimep)^2, na.rm = TRUE) / pixel_cnt
  pixel_error / var(pops_bimep, na.rm = TRUE)
}

# This is the GOFR equation. This defines it.
gofr_of_maps <- function(pops_map, pops_bimep, normalize = FALSE) {
  if (normalize) {
    pops_bimep <- pops_bimep / sum(pops_bimep, na.rm = TRUE)
    pops_map <- pops_map / sum(pops_map, na.rm = TRUE)
  }
  pixel_cnt <- sum(is.finite(pops_map))
  # Ignore NA because they tell us where there is water. Rasters are designed to agree on water location.
  pixel_error <- sum((pops_map - pops_bimep)^2, na.rm = TRUE)
  pixel_error / sum((pops_map - mean(pops_bimep, na.rm = TRUE))^2, na.rm = TRUE)
}

for (gofr_make_idx in 1:nrow(gofr)) {
  gofr_source <- gofr[gofr_make_idx, "source"]
  gofr_grid <- gofr[gofr_make_idx, "grid"]
  gofr_resolution <- gofr[gofr_make_idx, "resolution"]
  bimep_name <- files_df[
    files_df$grid == gofr_grid & files_df$resolution == gofr_resolution & files_df$source == "BIMEP",
    "name"
    ]
  pops_bimep <- raster::getValues(gofr_map_type[bimep_name][[1]])
  compare_map <- gofr_map_type[gofr[gofr_make_idx, "name"]][[1]]
  pops_map <- raster::getValues(compare_map)
  stopifnot(length(pops_bimep) == length(pops_map))

  # GOFR for the whole island.
  gofr$GOFR[gofr_make_idx] <- gofr_of_maps(pops_bimep, pops_map, normalize_gofr)

  # Iterate through the four admin2 by ID.
  segmented <- raster::getValues(raster::rasterize(
    admin2_st, compare_map, field = admin2_sf$OBJECTID))
  for (adm2_idx in 1:length(admin_names)) {
    sub_bimep <- pops_bimep[segmented == adm2_idx]
    sub_pops <- pops_map[segmented == adm2_idx]
    gofr[[admin_names[adm2_idx]]][gofr_make_idx] <- gofr_of_maps(sub_bimep, sub_pops, normalize_gofr)
  }
}
gofr
}
# Set to maps or density for original map or density surface.
gofr_map_type <- maps
gofr_absolute <- calculate_gofr(FALSE, gofr_map_type)
gofr_normalized <- calculate_gofr(TRUE, gofr_map_type)
choose <- 1:nrow(gofr_normalized)
cols <- c("grid", "resolution", "GOFR", "Baney", "Luba", "Malabo", "Riaba")
gofr_table <- rbind(gofr_absolute[choose, cols], gofr_normalized[choose, cols])
gofr_table

The incoming maps are HRSL fine, WorldPop fine, and LandScan coarse.

# This is here to show how to get the table into LaTeX. Add format = "latex" to the line below.
# You'll want to run this in the terminal and copy to the document.
knitr::kable(t(gofr_table[, !names(gofr_table) %in% c("resolution")]), digits = 4)

Let's remind ourselves what the WorldPop looks like, because those numbers are huge.

plot(maps[["WorldPop-U on WorldPop fine"]])

Map Plots

We need a couple of pretty plots.

library(tmap)
# I made this outline from the admin2 borders using QGis b/c the file
# made with sf::st_combine wasn't valid.
bbioko <- sf::st_transform(sf::st_read(rprojroot::find_package_root_file(
  "inst/extdata/Bioko_outline.shp")),
  raster::crs(maps[1][[1]]))
qleg <- FALSE
qbreaks <- c(0, 1, 10, 100, 1000, 10000, 100000)
qnames <- c("<1", "1-10", "10-100", "100-1000", "1000-10000", ">10000")
qpal <- c("#BBBBBBFF", viridis_mock(length(qnames) - 1))
# raster::projection(density[["BIMEP on HRSL coarse"]]) <- "+proj=longlat +datum=WGS84 +no_defs"
# tmap::tm_shape(bbioko) + tmap::tm_borders(col = "black", lwd = 1) +
for (set_crs_idx in 1:length(density)) {
  raster::crs(density[[set_crs_idx]]) <- utm_projection
}
qbm <- tm_shape(density[["BIMEP on HRSL coarse"]]) +
  tm_raster(title = "Population", palette = qpal, breaks = qbreaks, labels = qnames) +
  tm_layout(title = "BIMEP", frame = FALSE) +
  tm_legend(show = TRUE, legend.position = c(0.02, .01))

qwpc <- tm_shape(density[["WorldPop-C on WorldPop coarse"]]) +
  tm_raster(title = "qwpc", palette = qpal, breaks = qbreaks, labels = qnames) +
  tm_layout(title = "WorldPop-C", frame = FALSE) + tm_legend(show = qleg)

qwpu <- tm_shape(density[["WorldPop-U on WorldPop coarse"]]) +
  tm_raster(title = "qwpu", palette = qpal, breaks = qbreaks, labels = qnames) +
  tm_layout(title = "WorldPop-U", frame = FALSE) + tm_legend(show = qleg)

gpw <- tm_shape(density[["GPW on LandScan coarse"]]) +
  tm_raster(title = "gpw", palette = qpal, breaks = qbreaks, labels = qnames) +
  tm_layout(title = "GPW", frame = FALSE) + tm_legend(show = qleg)

qls <- tm_shape(density[["LandScan on LandScan coarse"]]) +
  tm_raster(title = "qls", palette = qpal, breaks = qbreaks, labels = qnames) +
  tm_layout(title = "LandScan", frame = FALSE) + tm_legend(show = qleg)

qhrsl <- tm_shape(density[["HRSL on HRSL coarse"]]) +
  tm_raster(title = "qhrsl", palette = qpal, breaks = qbreaks, labels = qnames) +
  tm_layout(title = "HRSL", frame = FALSE) + tm_legend(show = qleg)

quadmap <- tmap_arrange(qbm, qls, qhrsl, qwpc, gpw, qwpu, ncol = 2, nrow = 3)
tmap_save(quadmap, "islandquad.png")
tmap_save(quadmap, "islandquad.pdf")
quadmap

library(tmap)
bcrop <- raster::extent(c(8.685, 8.85, 3.65, 3.80))  # for maps, not densities.
#bcrop <- raster::extent(c(440000, 490000, 350000, 424000))
original_map_names <- c(
 "BIMEP on HRSL fine",
 "HRSL on HRSL fine",
 "LandScan on LandScan coarse",
 "WorldPop-C on WorldPop fine",
 "GPW on LandScan coarse",
 "WorldPop-U on WorldPop fine"
)
short_names <- vapply(strsplit(original_map_names, " "), function(x) x[[1]], FUN.VALUE = character(1))
bmaps <- maps[original_map_names]
bbreaks <- c(0, 1, 5, 10, 50, 100)
bnames <- c("0-1", "1-5", "5-10", "10-50", "50-100")
#bnames[seq(2, length(bnames), 2)] <- ""
bpal <- c("#BBBBBBFF", viridis_mock(length(bbreaks) - 1))

The LandScan plot doesn't look right, so let's take a closer look before including it with the rest.

map_idx <- 3
landscan_closeup <- raster::crop(bmaps[map_idx][[1]], bcrop)
raster::res(landscan_closeup)[[1]]
prod(raster::res(landscan_closeup))
raster::plot(landscan_closeup, colNA="black")
raster::hist(landscan_closeup)
vv <- raster::getValues(landscan_closeup)
c(min(vv, na.rm = TRUE), max(vv, na.rm = TRUE))
vv
tm_shape(landscan_closeup + 0.1) +
  tmap_options(max.categories = 1936) +
  tm_raster(colorNA = "black") +
  tm_layout(frame = TRUE)

bcropped <- raster::crop(bmaps[1][[1]], bcrop)
fig2_order <- c("BIMEP", "LandScan", "HRSL", "WorldPop-C", "GPW", "WorldPop-U")
stopifnot(all(fig2_order %in% short_names))
stopifnot(all(short_names %in% fig2_order))
# Adding 0.1 is a way to put integers in the middle of the breakpoints.
small_maps <- lapply(1:length(fig2_order), function(map_idx) {
  chosen_bmap <- which(startsWith(names(bmaps), fig2_order[map_idx]))
  crop_map <- raster::crop(bmaps[chosen_bmap][[1]], bcrop) + 0.1
  # For maps with 10x the side length.
  density_adjust <- (.3 / 360)^2 / prod(raster::res(crop_map))
  tm_shape(crop_map * density_adjust) +
  tm_raster(title = short_names[map_idx], palette = bpal, breaks = bbreaks, labels = bnames) +
  tm_legend(show = FALSE) +
  tm_layout(title = LETTERS[map_idx], frame = FALSE)
})
bleg <- tm_shape(bcropped) +
  tm_raster(title = "Population", palette = bpal, breaks = bbreaks, labels = bnames) + 
  tm_legend(legend.only = TRUE, position = c("center", "top"), title.size = 2, text.size = 1.5)
small_maps[length(small_maps) + 1] <- bleg
small_maps[c("ncol", "nrow")] <- c(2, 3)
#closeup <- tmap_arrange(bp1, bp2, bp3, bleg, ncol = 2, nrow = 2)
# closeup <- do.call(tmap_arrange, small_maps)
sm <- small_maps
closeup <- tmap_arrange(sm[[1]], sm[[2]], sm[[3]], sm[[4]], sm[[5]], sm[[6]], nrow = 3, ncol = 2)
tmap_save(closeup, "closeup.png")
tmap_save(closeup, "closeup.pdf")
closeup

bp1 <- tm_shape(bcropped) +
  tm_raster(title = "BIMEP", palette = bpal, breaks = bbreaks, labels = bnames) +
  tm_legend(show = FALSE) +
  tm_layout(title = "A", frame = FALSE)
bp2 <- tm_shape(raster::crop(bmaps[2][[1]], bcrop)) +
  tm_raster(title = "WorldPop", palette = bpal, breaks = bbreaks, labels = bnames) +
  tm_legend(show = FALSE) +
  tm_layout(title = "B", frame = FALSE)
bp3 <- tm_shape(raster::crop(bmaps[3][[1]], bcrop)) +
  tm_raster(title = "HRSL", palette = bpal, breaks = bbreaks, labels = bnames) +
  tm_legend(show = FALSE) +
  tm_layout(title = "C", frame = FALSE)

Note, for comparison, how diffuse the density map is, because the density map averages over a disc that's a square kilometer in area.

bcrop <- raster::extent(c(465000, 490000, 400000, 424000))
bmaps <- density[c(3, 12, 4)]  # bimep, worldpop, hrsl
raster::plot(raster::crop(bmaps[1][[1]], bcrop))

Figure 7: PR by Population

This makes the plot for figure 7.

cdfAreaLines <- function(H, source, resolution, linewidth = 2) {
  H <- sort(H)
  area <-  c(1:length(H)) / length(H)
  lines(
    area,
    cumsum(H) / sum(H),
    col = colors[source],
    lwd = linewidth,
    lty = ifelse(resolution == "coarse", 1, 2)
    )
}

cdfAreaFigure <- function(llwd = 1, label = "") {
  plot(
    vector(mode = "numeric", length = 0),
    xlim = c(0, 1), ylim = c(0, 1),
    xlab =  "% Area", ylab =  "Proportion of Population",
    main = label
    )
  sources <- c("BIMEP", "HRSL", "LandScan", "WorldPop-C", "WorldPop-U", "GPW")
  for (source in sources) {
    for (resolution in c("coarse", "fine")) {
      if (source == "BIMEP") {
        data <- linear_from_source_resolution("HRSL", resolution, density)$x
      } else {
        data <- linear_from_source_resolution(source, resolution, density)$y
      }
      cdfAreaLines(data, source, resolution, linewidth = llwd)
    }
  }
  legend("topleft", legend =  sources, col = colors[sources],
         lty = 1, lwd = 2, cex = 1)
}

#cdfAreaFigure(2)

# Get the pfpr data.
source_data <- "/home/adolgert/dev/popbioko/inst/extdata/source"
load(file.path(source_data, "popData_image.RData"))
load(file.path(source_data, "pfpr_pops.RData"))
popk2 <- distinct(popk, popk$areaId, .keep_all = T)
poppr <- merge(mcprk, popk2, by = "areaId", all.x = T)
poppr <- poppr[!is.na(poppr$pop1k), ]

PfPR <- poppr$pfpr

H0 <- poppr$pop1k / sum(poppr$pop1k, na.rm = T)
H1 <- poppr$wppop / sum(poppr$wppop, na.rm = T)
H2 <- poppr$lspop / sum(poppr$lspop, na.rm = T)
H3 <- poppr$popfb / sum(poppr$popfb, na.rm = T)

library(st)
library(sf)
areas_sf <- sf::st_as_sf(areasgr)
areas_df <- sf::st_set_geometry(areas_sf, NULL)
stopifnot(all(sort(areas_df[[1]]) == areas_df[[1]]))
library(raster)
sources <- c(
  "BIMEP on HRSL fine", "HRSL on HRSL fine", "LandScan on LandScan fine",
  "WorldPop-C on WorldPop fine", "WorldPop-U on WorldPop fine", "GPW on LandScan fine")
aligned <- list()
for (align_source in sources) {
  base_name <- strsplit(align_source, " ")[[1]][1]
  save_name <- paste0(base_name, "_gr.Rda")
  if (file.exists(save_name)) {
    aligned[[align_source]] <- local({
      load(save_name)
      bimep_gr
    })
  } else {
    density_map <- density[[align_source]]
    bimep_gr <- raster::extract(
      density_map, areasgr, cellnumbers = TRUE, method = "bilinear",
      na.rm = TRUE, fun = mean)
    save(bimep_gr, file = save_name)
    aligned[[align_source]] <- bimep_gr
  }
}
on_bgrid <- lapply(aligned, function(pop_matrix) {
  pm <- as.array(pop_matrix)
  pm[is.nan(pm)] <- 0  # Also an artifact.
  pm[pm < 0] <- 0  # largest negative is 10^-11. From bilinear interpolation.
  pm
  })
on_bgrid[["areaId"]] <- areas_df[["FID"]]
on_bgrid_df <- do.call(data.frame, on_bgrid)
poppr <- merge(mcprk, on_bgrid_df, by = "areaId", all.x = TRUE)
poppr <- poppr[complete.cases(poppr),]
row.is.finite <- apply(poppr, 1, function(x) any(is.finite(x)))
poppr <- poppr[row.is.finite, ]

prBYh_cdf <- function(Pf, H, B = 100) {
  xx <- seq(0, 1, length.out = B + 1)
  cdf <- 0 * xx
  for (i in 0:B) {
    ix <- which(Pf <= xx[i + 1])
    cdf[i + 1] <- ifelse(length(ix) == 0, 0, sum(H[ix]))
  }
  cbind(xx, cdf)
}

prBYh_pdf <- function(Pf, H, B = 100) {
  xx <- prBYh_cdf(Pf, H, B)
  xmid <- (xx[-1, 1] + xx[-B - 1, 1]) / 2
  pdf <- diff(xx[, 2])
  cbind(xmid, pdf)
}

cdfPfPlot <- function(Pf, H, B = 100, clr = "black", llwd = 2) {
  xx <- prBYh_cdf(Pf, H, B)
  plot(xx[, 1], xx[, 2],
    type = "l", xlab = "PfPR", ylab = "Population fraction",
    main = "A", lwd = llwd, col = clr, xlim = c(0, .45),
    ylim = c(0, 1)
  )
}

cdfPfLines <- function(Pf, H, B = 100, clr = "red", llwd = 2) {
  xx <- prBYh_cdf(Pf, H, B)
  lines(xx[, 1], xx[, 2], lwd = llwd, col = clr)
}

pdfPfPlot <- function(Pf, H, B = 100, llwd = 2, clr = "black") {
  xx <- prBYh_pdf(Pf, H, B)
  plot(xx[, 1], xx[, 2], col = clr, type = "l", lwd = llwd, xlim = c(0, .45), ylim = c(0, 0.30), xlab = "PfPR", ylab = "Population fraction", main = "B")
}

pdfPfLines <- function(Pf, H, B = 100, clr = "red", llwd = 2) {
  xx <- prBYh_pdf(Pf, H, B)
  lines(xx[, 1], xx[, 2], col = clr, lwd = llwd)
}

PfPR <- poppr$pfpr
sources <- c("BIMEP", "HRSL", "LandScan", "WorldPop.C", "WorldPop.U", "GPW")
pop_cols <- vapply(sources, function(x) names(poppr)[startsWith(names(poppr), x)], character(1))

pfprFigure <- function() {
  par(mfrow = c(1, 2))
  for (cdf_idx in seq(pop_cols)) {
    BB <- 80
    pops <- poppr[[pop_cols[cdf_idx]]]
    normed_data <- pops / sum(pops)
    short_name <- gsub("\\.", "-", names(pop_cols)[cdf_idx])
    if (cdf_idx == 1) {
      cdfPfPlot(PfPR, normed_data, BB, clr = colors[short_name])
    } else {
      cdfPfLines(PfPR, normed_data, BB, clr = colors[short_name])
    }
  }

  short_names <- character(0)
  colors_used <- character(0)
  for (pdf_idx in seq(pop_cols)) {
    pops <- poppr[[pop_cols[pdf_idx]]]
    normed_data <- pops / sum(pops)
    short_name <- gsub("\\.", "-", names(pop_cols)[pdf_idx])
    if (pdf_idx == 1) {
      pdfPfPlot(PfPR, normed_data, BB, clr = colors[short_name])
    } else {
      pdfPfLines(PfPR, normed_data, BB, clr = colors[short_name])
    }
    short_names <- c(short_names, short_name)
    colors_used <- c(colors_used, colors[short_name])
  }
  legend("topright",
    legend = short_names, col = colors_used,
    lty = 1, lwd = 2, cex = 1
  )
}

pfprFigure()
pdf("PRbypop_.pdf", width = 10, height = 5)
pfprFigure()
a <- dev.off()

dd-harp/population_comparison_bioko documentation built on Feb. 28, 2021, 11:05 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dd-harp/population_comparison_bioko
Compares population estimates for Bioko Island, Equatorial Guinea

In dd-harp/population_comparison_bioko: Compares population estimates for Bioko Island, Equatorial Guinea

Overview

Maps

Summary Statistics

Urban Fraction

Raw Urban Fraction

Kernel Density Estimation

Construct the Summary Statistics Table

Comparison Tables

Compare two urban fractions

Compare maps of raw and estimated urban fractions

Compare across source grids

Compare across resolutions

Error Bounds

Discussion

Population Scatter Plots

Cumulative Maps

Proportion, Accuracy, Recall, and Precision

Define the values

Proportion

Precision

Accuracy

Recall

Table of ranges

Goodness of Fit Ratio

Map Plots

Figure 7: PR by Population

R Package Documentation

Browse R Packages

We want your feedback!

dd-harp/population_comparison_bioko Compares population estimates for Bioko Island, Equatorial Guinea

In dd-harp/population_comparison_bioko: Compares population estimates for Bioko Island, Equatorial Guinea

Overview

Maps

Summary Statistics

Urban Fraction

Raw Urban Fraction

Kernel Density Estimation

Construct the Summary Statistics Table

Comparison Tables

Compare two urban fractions

Compare maps of raw and estimated urban fractions

Compare across source grids

Compare across resolutions

Error Bounds

Discussion

Population Scatter Plots

Cumulative Maps

Proportion, Accuracy, Recall, and Precision

Define the values

Proportion

Precision

Accuracy

Recall

Table of ranges

Goodness of Fit Ratio

Map Plots

Figure 7: PR by Population

R Package Documentation

Browse R Packages

We want your feedback!

dd-harp/population_comparison_bioko
Compares population estimates for Bioko Island, Equatorial Guinea