Population Downscaling Using Areal Interpolation - A Comparative Analysis"
In populR: Population Downscaling Using Areal Interpolation

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Areal Interpolation may be defined as the process of transforming data reported over a set of spatial units (source) to another (target). Its application to population data has attracted considerable attention during the last few decades. A massive amount of methods have been reported in the scientific literature. Most of them focus on the improvement of the accuracy by using more sophisticated techniques rather than developing standardized methods. As a result, only a few implementation tools exists within the R community.

One of the most common, easy and straightforward methods of Areal Interpolation is Areal Weighting Interpolation (AWI). AWI proportionately interpolates the population values of the source features based on areal (or spatial) weights calculated by the area of intersection between the source and the target zones.

sf and areal packages provide Areal Interpolation functionality within the R ecosystem. Both packages implement (AWI). sf functionality comes up with extensive and intensive interpolation options and calculates the areal weights based on the total area of the source features (total weights). sf functionality is suitable for completely overlapping data. areal extends the existing functionality of the sf package by introducing an additional formula for data without complete overlap. In this case weights are calculated using the sum of the remaining source areas after the intersection (sum weights).

When the case involves Areal Interpolation of urban population data (small scale applications) where the source features (such as city blocks or census tracts) are somehow larger than target features (such as buildings) in terms of footprint area the sf functionality (total weights) is unable to calculate areal weights properly and therefore, is not ideal for such applications. areal functionality may be confusing for novice R (or GIS) users as it is not obvious that the weight option should be set to sum to calculate areal weights correctly.

To overcome these limitations populR is introduced. populR is suitable for Areal Interpolation of urban population and provides an AWI approach that matches the existing functionality of areal using sum weights and additionally, proposes a VWI approach which, to our knowledge, extends the existing Areal Interpolation functionality within the R ecosystem. VWI uses the area of intersection between source and target features multiplied by the building height or number of floors (volume) to guide the interpolation process.

In this vignette a comparative analysis of Areal Interpolation alternatives within the programming environment of R is carried out. sf, areal and populR results are obtained and further compared to a more realistic population distribution.

Case Study

A small part of the city of Mytilini, Lesvos, Greece was chosen as the case study (figure below).The study area consists of 9 city blocks (source) counting 911 residents and 179 buildings units (target) including floor number information. These data are included in populR package for further experimentation.

# attach library
library(populR)

# load data
data('src')
data('trg')

source <- src
target <- trg

# plot data
plot(source['geometry'], col = "#634B56", border = NA)
plot(target['geometry'], col = "#FD8D3C", add = T)

Implementation

In this section a demonstration of the sf, areal and populR packages takes place. First, the packages are attached to the script and next populR built-in data are loaded. Then, Areal Interpolation functions are executed for each one of the aforementioned packages.

The st_interpolate_aw() function of the sf package takes:

x: an object of class sf with data to be interpolated
to: the target geometries (sf object)
extensive: whether to use extensive (TRUE) or intensive interpolation (FALSE)

areal provides the aw_interpolate() function which requires:

data: an sf object to be used as target
tid: target identification numbers
source: an sf object with data to be interpolated
sid: source identification numbers
weight: may be either sum or total for extensive interpolation and sum intensive interpolation
output: whether sf object or tibble
extensive: a vector of quoted (extensive) variable names - optional if intensive is specified
intensive: a vector of quoted (intensive) variable names - optional if extensive is specified

Finally, populR offers pp_estimate() function which takes:

target: an sf object to be used as target
source: an sf object with data to be interpolated
sid: source identification number
spop: source population values to be interpolated
volume: target volume information (number of floors or height) - required for the vwi approach
point: whether to return point geometries (TRUE) or not (FALSE) - optional
method: whether to use awi or vwi

Evidently, sf package's st_interpolate_aw function requires only 3 arguments which make it very easy to implement while populR requires at least 5 and areal at least 7 arguments which potentially increases the implementation complexity.

On the other hand, only areal may be used for multiple interpolations at once as the extensive or intensive argument takes a vector of quoted values (not included in this vignette).

For the reader's convenience names were shortened as follows:

awi: populR awi approach
vwi: populR vwi approach
aws: areal using extensive interpolation and sum weights
awt: areal using extensive interpolation and total weights
sf: sf using extensive interpolation

# attach libraries 
library(populR)
library(areal)
library(sf)

# load data
data('src')
data('trg')

source <- src
target <- trg

# populR - awi
awi <- pp_estimate(target = target, source = source, spop = pop, sid = sid, 
                   method = awi)
# populR - vwi
vwi <- pp_estimate(target = target, source = source, spop = pop, sid = sid, 
                   volume = floors, method = vwi)

# areal - sum weights
aws <- aw_interpolate(target, tid = tid, source = source, sid = 'sid', 
                      weight = 'sum', output = 'sf', extensive = 'pop')
# areal - total weights
awt <- aw_interpolate(target, tid = tid, source = source, sid = 'sid', 
                      weight = 'total', output = 'sf', extensive = 'pop')

# sf - total weights
sf <- st_interpolate_aw(source['pop'], target, extensive = TRUE)

Results

The study area counts 911 residents as already mentioned in previous section. From the code chunk below it is clear that awi, vwi and aws correctly estimated population values as they sum to 911 while awt and sf results underestimated values. This is expected as both methods use the total area of the source features during the interpolation process and are useful when source and target features completely overlap.

# sum initial values
sum(source$pop)

# populR - awi
sum(awi$pp_est)

# populR - vwi
sum(vwi$pp_est)

# areal - awt
sum(awt$pop)

# areal - aws
sum(aws$pop)

# sf
sum(sf$pop)

Moreover, identical results were obtained by the awi and aws approaches and somehow different results by the vwi as shown in the code block below.

# order values using tid
awi <- awi[order(awi$tid),]
vwi <- vwi[order(vwi$tid),]

# get values and create a df
awi_values <- awi$pp_est
vwi_values <- vwi$pp_est

awt_values <- awt$pop
aws_values <- aws$pop

sf_values <- sf$pop

df <- data.frame(vwi = vwi_values, awi = awi_values, aws = aws_values,
                 awt = awt_values, sf = sf_values)

df[1:20,]

Comparison to Reference Data

Due to confidentiality concerns, population data at building level are not available in Greece. Therefore, an alternate population distribution previously published in Batsaris et al. 2019 was used as reference data set to compare the results.

This reference population values are included in the built-in data set as shown below in the field rf.

target

In the code chunk below the first 20 features are presented for comparison.

rf <- awi$rf

df <- cbind(rf, df)

df[1:20,]

populR provides a function (pp_compare()) to compare the results with alternate population data. pp_compare() produces scatter diagram, linear regression model, correlation coeficient ($R^2$), MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error) to investigate the relationship of the results with the reference (or other) data.

Generally, the diagrams suggest strong and positive relationships in all cases. However, vwi provides the strongest relationship and $R^2$ coefficient. vwi provides the smallest MAE value in comparison with the other methods as shown below.

awi_error <- pp_compare(df, estimated = awi, actual = rf, title = "awi vs actual")
awi_error

vwi_error <- pp_compare(df, estimated = vwi, actual = rf, title = "vwi vs actual")
vwi_error

sf_error <- pp_compare(df, estimated = sf, actual = rf, title = "sf vs actual")
sf_error

awt_error <- pp_compare(df, estimated = awt, actual = rf, title = "awt vs actual")
awt_error

aws_error <- pp_compare(df, estimated = aws, actual = rf, title = "aws vs actual")
aws_error

RMSE (Root Mean Squared Error) is also calculated. Again, vwi provides the smallest error value as shown in the code block below.

Comparison on Performance

Finally, a performance comparison (execution times) is carried out in this section using microbenchmark package. Execution time measurements suggest that populR functionality executed much faster than areal and sf as shown below. Both awi and vwi achieved the best mean execution time (about 76.74 milliseconds). aws follows with 136.67 milliseconds and finally, awt with 180.53 milliseconds.

library(microbenchmark)

# performance comparison
microbenchmark(
  suppressWarnings(pp_estimate(target = target, source = source, spop = pop, sid = sid, 
                   method = awi)),
  suppressWarnings(pp_estimate(target = target, source = source, spop = pop, sid = sid, 
                   volume = floors, method = vwi)),
  aw_interpolate(target, tid = tid, source = source, sid = 'sid', 
                      weight = 'sum', output = 'sf', extensive = 'pop'),
  aw_interpolate(target, tid = tid, source = source, sid = 'sid', 
                      weight = 'total', output = 'sf', extensive = 'pop'),
  suppressWarnings(st_interpolate_aw(source['pop'], target, extensive = TRUE))
)

Summary

In this vignette a demonstration and a comparative analysis of areal interpolation packages implemented in urban population data is undertaken. Both sf and areal packages provide general purpose AWI functionality while populR package focuses on areal interpolation of population data. Additionally, populR provides VWI which extends R's existing functionality.

The city of Mytilini, Greece was used as the case study to investigate three main pillars: a) implementation, b) results, c) performance. Notes on implementation indicate that sf package requires only 3 arguments to use while populR at least 5 and areal 7. The results provide insight that sf and awt may not be ideal for data that are not completely overlapping. Moreover, aws and awi obtained the same results while vwi outperformed the others in comparison to the reference data set. Finally, populR performs much faster than sf and areal packages.