SPEW: A brief tour

library(knitr) # We need the knitr package to set chunk options
# Set default knitr options for knitting code into the report:
# opts_chunk$set(cache=TRUE, autodep=TRUE, cache.comments=FALSE)


Introduction

Welcome to SPEW (Synthetic Population and Ecosystems of the World). This guide is for use in for basic usage of the R package spew. This vignette will walk through the set-up of the fictitious human ecosystem of Tartanville. We show how to quickly 'spew' a synthetic ecosystem; sample agents using different methods for location assignments, sample agents using different methods for population characteristic assignments, discuss the essential input data for SPEW, expand upon the supplementary data format required for SPEW, and briefly describe some of the more advanced functions of SPEW.

SPEW uses random sampling to sample both locations of residence for the synthetic individuals and individual records from the microdata. SPEW is modular in that it can use different methods of sampling to emphasize different aspects of the synthetic ecosystem. A few methods are demonstrated here. The methods in bold are the default options.

A true quickstart

Downloading SPEW

Open up an R session and run the following commands.

  install.packages("spew")

Next, load the spew library into the R-session with:

library(spew)

Tartanville: SPEW basics

To plot with functions in SPEW, make sure ggplot2 is installed.

data(tartanville)

tartanville_syneco <- spew(tartanville$pop_table, tartanville$shapefile,
                            tartanville$pums_h, tartanville$pums_p)

plot_syneco(tartanville, tartanville_syneco, region_name = "Tartanville")

out <-  summarize_spew_out(tartanville_syneco, vars_to_sum_h = c("puma_id"),
                            vars_to_sum_p = c("SEX"),
                           vars_to_sum_env = NULL, top_region_id = "Tartanville")
print(out)

g <- plot_characteristic_proportions(feature_name = "Sex", legend_name = "Sex",
                                feature_df = out$SEX,
                                category_names = c("Male", "Female"),
                                text_size = 10,
                                region_colors = c("lightslateblue", "maroon1"))

g <- plot_pop_totals(feature_df = out$pop_totals, type = "n_people")

Sampling of locations

In SPEW, we assign locations to our synthetic agents. We currently support two methods of location assignment:

  1. Uniform sampling
  2. Road-based sampling

Uniform sampling

Uniform sampling of locations means sampling uniformly within the boundaries of a region from the inputted shapefile. Population density is maintained at a macro-level but may not accurately reflect the actual population density of a region. For instance, synthetic individuals may be placed in lakes by mistake under this method.

Ex. As shown in the 'A true quickstart' tab.

Road-based sampling

Roads-based sampling requires the shapefile argument to be a list with two entries, the first as the boundaries and the second as the roads. The first entry should be a shapefile of polygons, and the second, a shapefile of lines. We see that in the resulting synthetic ecosystem, the agents are assigned locations closer to the roads than in uniform sampling. We can adjust the variability of how close the agents are to the roads by adjusting roads_noise. Larger values result in larger variance.

Ex.

data(tartanville)
shapefile <- list(boundaries = tartanville$shapefile, roads = tartanville$roads)

tartanville_syneco <- spew(tartanville$pop_table, shapefile,
                            tartanville$pums_h, tartanville$pums_p,
                            locations_method = "roads", road_noise = .05)

plot_syneco(tartanville, tartanville_syneco, region_name = "Tartanville")

Sampling of characteristics

In SPEW, we assign population characteristics to our synthetic agents. We currently support three methods of population characteristics assignment:

  1. Uniform sampling
  2. Moment Matching (MM)
  3. Iterative Proportional Fitting (IPF)

Uniform sampling

Uniform sampling of population characteristics means sampling from the microdata where each record is given equal weight. Thus, if the microdata is thought to be representative of the regions, then the resulting synthetic individuals should have close to the correct distribution of population characteristics. If the microdata is not representative, then we must implement other sampling methods.

Ex. As shown in the 'A true quickstart' tab.

Moment Matching

Assume we have the first moment(s) of a continuous or ordered variable(s)'s distribution for every sub-region in the region of interest (e.g. average household size for every block in Tartanville). Moment matching is a method to calculate weights for the microdata records so that after sampling, the sample average of the variable(s) will be matched to the given averages. We demonstrate with the average household size (NP) of Tartanville. The average household sizes have been exaggerated here as we are generating a small synthetic population.

Ex.

We first make the moments object for use in SPEW. Here, the moments object is the average household size for each block. The function make_mm_obj puts the data necessary for MM in the right format. The format is discussed more in another section. The supplementary_data is thus a list including the moments outputted from make_mm_obj.

NP_avg <- c(3.2, 0, 6.0,
            2.0, 3.2, 3.1,
            4.0, 4.8, 3.9)
supplementary_data <- list(moments = make_mm_obj(moments_list =
                           list(mom1 = data.frame(place_id = paste0("T", 1:9),
                                                  puma_id = "T",
                                                  NP = NP_avg)),
                           assumption = "independence",
                                nMom = 1, type = "cont"))

And then we use SPEW, with the additional library quadprog, which is required for MM. Note we have supplemented spew with both a supplementary_data argument and the sampling method of 'mm'. The outputted synthetic ecosystem is in the same format. We see that MM matches the inputted average household sizes well, especially compared to uniform sampling.

data(tartanville)
tartanville_syneco_mm <- spew(pop_table = tartanville$pop_table, 
                              shapefile = tartanville$shapefile,
                              pums_h = tartanville$pums_h, 
                              pums_p = tartanville$pums_p,
                              marginals = supplementary_data$moments, 
                              sampling_method = "mm")

non_empty_regions <- sapply(tartanville_syneco_mm, class) == "list"
synthetic_NP_avg <- sapply(tartanville_syneco_mm[non_empty_regions], function(ll){
    mean(ll$households$NP, na.rm = TRUE)
    })
sum(abs(synthetic_NP_avg - NP_avg[-2]))


# Comparing to uniform sampling
synthetic_NP_avg_unif<- sapply(tartanville_syneco[non_empty_regions], function(ll){
    mean(ll$households$NP, na.rm = TRUE)
    })
sum(abs(synthetic_NP_avg_unif- NP_avg[-2]))

Iterative Proportional Fitting

Iterative Proportional Fitting is a method discovered by Deming and Stephan (1941) and implemented as a sampling scheme by Beckman et al. (1996). The idea behind IPF is filling in a contingency table where the marginal totals are known. As such, an important step is "cutting" our variables into discrete categories if this has not already been done.

In SPEW, we do this by creating a marginals object which contains information both how to cut our variables into categories and how many of each category we expect in each region for each marginal variable. We demonstrate this below.

Ex. We must make the marginals object for use of the IPF sampling method. We create two marginal objects, one for household income and the other for head of household race and combine them together for use in SPEW.

## Income
var_name <- "HHINC"
## How to cut the variable
type <- "ord"
bounds <- data.frame(lower = c(0, 50), upper = c(49, Inf))
category_name <- c("HHINC_0-49", "HHINC_50-Inf")
## How often we expect to see each category for each region.
df <- data.frame(place_id = paste0("T", 1:9),  v1 = c(30, 0, 5, 10, 13, 9, 2, 1, 5))
df$v2 <- tartanville$pop_table$n_house - df$v1
#
ipf_obj_hhinc<- make_ipf_obj(var_name, type, bounds, category_name, df = df)

ipf_obj_hhinc

## Head of Household Race
var_name <- c("RAC1P")
type <- "cat"
bounds <- data.frame(lower = c(1, 2), upper = c(1, 2))
category_name <- c("Tartan", "Argyle")
df2 <- data.frame(place_id = paste0("T", 1:9),  v1 = c(28, 0, 4, 1, 5, 8, 2, 1, 3))
df2$v2 <- tartanville$pop_table$n_house - df2$v1
ipf_obj_rac1p <- make_ipf_obj(var_name, type, bounds, category_name, df = df2)

# Combine both together
ipf_obj <- list(HHINC = ipf_obj_hhinc[[1]], RAC1P = ipf_obj_rac1p[[1]])
supplementary_data <- list(moments = ipf_obj)

As demonstration of how we cut our variables, we use a function called align_pums.

pums_h <- align_pums(tartanville$pums_h, ipf_obj, suffix = "_ipf") # split into categories
knitr::kable(head(pums_h))

We are now ready to perform IPF-sampling and do so below. SPEW relies on the mipfp package. We find that IPF matches the table we gave as expected number of households for each category in each marginal variable quite well. Note: IPF does not always converge, generally due to having a cell in the contingency with value 0.

tartanville_syneco_ipf <- spew(tartanville$pop_table, tartanville$shapefile,
                               pums_h, tartanville$pums_p,
                               marginals = supplementary_data$moments, 
                               sampling_method = "ipf")


out <-  summarize_spew_out(tartanville_syneco_ipf, vars_to_sum_h = c("HHINC_ipf", "RAC1P_ipf"),
                            vars_to_sum_p = c("SEX"),
                    vars_to_sum_env = NULL, top_region_id = "Tartanville",
                    marginals = supplementary_data$moments)

sum(abs(as.matrix(out$HHINC_ipf[order(out$HHINC_ipf$region), -3]) - as.matrix(df[-2, -1])))
sum(abs(as.matrix(out$RAC1P_ipf[order(out$HHINC_ipf$region), -3]) - as.matrix(df2[-2, -1])))

We see we almost match almost perfectly on household income and very closely on race.

Assigning environments

General assignment function

We can also assign environments to the population of Tartanville. Environments are locations where the agents interact, both with one another, and possibly the environment itself.

Ex. We assign the children between ages 5 and 18 in T1 to one of the two schools in Tartanville.

knitr::kable(tartanville$environments)
tartanville_syneco <- spew(tartanville$pop_table, tartanville$shapefile,
                              tartanville$pums_h, tartanville$pums_p)

plot_syneco(tartanville, tartanville_syneco,
            region_name = "Tartanville")

We first subset the data frame of students in T1.

t1_people <- tartanville_syneco[[1]]$people
t1_students <- subset(t1_people, subset = (t1_people$AGEP >= 5 & t1_people$AGEP <= 18))
places <- subset(tartanville$environments, tartanville$environments$Type == "School")

We then assign the schools to the children, first without regard to capacity.

t1_assignments <- assign_place_coords(t1_students, places = places, place_name = "school")
table(t1_assignments$school)

These values are consistent as Andrew High (T7) is slightly closer to T1 than Maggie Mo High (T9), but only by a small distance.

However, we note that Maggie Mo High has about 4 times the capacity than Andrew High. If we take capacity into account, then we have more students going to Maggie Mo High. The user is encouraged to provide their own distance and capacity functions. The details may be found in the documentation of assign_place_cords.

t1_assignments <- assign_place_coords(t1_students, places = places, place_name = "school", method = "capacity")
table(t1_assignments$school)

Essential input data format

SPEW requires three essential data inputs connected through the variables place_id and puma_id.

| Input Data | Description | Required Variables | Format | |-------------------|----------------------------------------------------------------|-----------------------|--------| | Population Counts | Table of number of individuals per region. | place_id, puma_id | .csv | | Shapefile | Spatial boundaries of the regions in Population Counts. | place_id | .shp | | Microdata | Table of individual records with individual-level characteristics. Also known as Public Use Microdata Samples (PUMS). | We need both household-level microdata (pums_h) and individual-level microdata (pums_p). | puma_id | .csv |

Given a row in the table of population counts with the variables place_id and puma_id, one should be able to find the unique region place_id in the shapefile that matches that of the row along with the microdata records with the same puma_id.

Additionally, the above may be combined into a list along with a roads variable, environments variable and a supplementary_data variable to form an object used to spew the final synthetic ecosystem, e.g. the format of tartanville

data(tartanville)
names(tartanville)
class(tartanville)
lapply(tartanville, class)

Supplementary data format

Supplementary data formats

Due to the increased complexity of the data required for more sophisticated sampling methods, we describe here the form supplementary data required for SPEW to function.

Road-based sampling (roads)

The roads are in the form of spatial lines.

data(tartanville)
class(tartanville$roads)

Moment Matching (MM) (moments)

The moments object is created through the make_mm_obj function. This function requires a moments_list argument which has the following format.

moments_list <- list(mom1 = data.frame(place_id = {my.place.ids}, puma_id = {my.pumas},
                                       var1 = {var1.avg}, var2 = {var2.avg}, ..., var{p} ={varp.avg}),
                       assumption = {my.assumption},
                       nMom = 1, type = "cont")

Finally, the moments object is added to the list of supplementary data,

supplementary_data <- list(moments = moments)

Iterative Proportional Fitting (IPF) (marginals)

The marginals requires the most data. The variable marginals is itself a list such as

list(var1 = ipf_obj_var1[[1]], var2 = ipf_obj_var2[[1]], var{p} = ipf_obj_varp[[1]])

The names of the list must be names that are either found in resulting synthetic households or synthetic persons.

The individual IPF objects are created using the make_ipf_obj function requiring the following arguments:

Finally, the marginals object is added to the list of supplementary data,

supplementary_data <- list(marginals = marginals)


Try the spew package in your browser

Any scripts or data that you put into this service are public.

spew documentation built on Nov. 17, 2017, 7:36 a.m.