knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Iterative proportional fitting (IPF) is used in many disciplines to adjust an initial set of weights to match various marginal distributions. This package implements the iterative proportional updating algorithm based on the paper from Arizona State University (IPU). In survey raking or population synthesis, the IPU algorithm has the added advantage of being able to match household- and person-level marginal targets.
The primary functions of the ipfr
package are discussed below.
ipu_matrix()
ipu()
synthesize()
suppressPackageStartupMessages({ library(ipfr) library(dplyr) })
The first example shows how to use ipfr
to solve one of the most common IPF
problems: Given a matrix of starting weights and a set of row and column
targets, balance the matrix in order to satisfy both row and column targets.
set.seed(42) mtx <- matrix(data = runif(9), nrow = 3, ncol = 3) row_targets <- c(3, 4, 5) column_targets <- c(5, 4, 3) mtx
The resulting matrix satisfies both row and column targets while retaining a similar relative distribution of weights in each cell.
result <- ipu_matrix(mtx, row_targets, column_targets) result rowSums(result) colSums(result)
The matrix example is common and conceptually simple, but makes thinking about higher-dimension problems difficult. A much better way to think about the problem (whether in two or ten dimensions) is as a table.
This example creates a household survey that will act as the seed (like the
matrix from the last example). The number of persons and autos for each
household is listed in the size
and autos
columns. A starting weight of 1 is
also included.
survey <- tibble( size = c(1, 2, 1, 1), autos = c(0, 2, 2, 1), weight = 1 ) survey
Assume that we know the total number of households by size and households by
auto-ownership from the Census. These target marginal distributions are just
like our row and column totals from the previous example. Targets are passed
into the ipu
function as shown below. Take the size targets, for example:
size
corresponds to the size
column in the survey.1
and 2
correspond to the unique values in the survey's
size
column.targets <- list() targets$size <- tibble( `1` = 75, `2` = 25 ) targets$autos <- tibble( `0` = 25, `1` = 50, `2` = 25 ) targets
A call to ipu()
returns a named list of results.
result <- ipu(survey, targets) names(result)
The first element is the resulting weight table. The primary seed has three columns appended:
result$weight_tbl
The second element is a histogram of the weight_factor
. This provides a quick
overview of the distribution of weights. Many large or small weights can
indicate underlying problems even if the ipu converged and matches targets. For
instance, a record with a weight factor of 50 means that it is greatly
over-represented in the re-weighted survey.
result$weight_dist
The next element is a comparison back to the targets provided. With complex seed and target tables, this makes investigating results quick and easy.
result$primary_comp
If secondary targets are provided to ipu()
, a fourth item in the list will
contain a secondary_comp
table.
In household survey expansion, it is common to want to control for certain
features that describe households (like size) while controlling for other
attributes that describe people (like age). This is possible with the ipu()
function.
This example is taken directly from the Arizona paper on page 20: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.723&rep=rep1&type=pdf
In this example, household type could represent size (e.g. 1-person and 2-person households). Person type could represent age groups (e.g. under 18, between 18 and 50, and over 50).
setup_arizona()
creates the seed and target tables used in the example.
result <- setup_arizona() hh_seed <- result$hh_seed hh_targets <- result$hh_targets per_seed <- result$per_seed per_targets <- result$per_targets
primary_seed
primary_target
secondary_seed
secondary_target
In the interest of keeping vignette build time short, the ipu()
algorithm is
only run for 30 iterations. After running for 400 or more iterations, the
results match closely to those shown in the paper.
result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30)
The comparison table for the secondary (person) marginals is now included in
result
. Feel free to run the code chunk above for 400 or more iterations and
then look again.
result$weight_tbl result$primary_comp result$secondary_comp
ipu()
allows different geographies to be specified for different marginal
tables. There are a few rules that make this possible, but in short, a geography
field in each target table tells the algorithm which scale to constrain to. If
no geography field is present in a target table, it applies to the entire seed
table (i.e. region).
All of the following rules are checked by the function and a warning message will show which (if any) is violated.
To demonstrate, the Arizona example is modified to add two different tracts for household controls but to still control the person targets at the regional level.
# Repeat the hh_seed to create tract 1 and 2 households new_hh_seed <- hh_seed %>% mutate(geo_tract = 1) new_hh_seed <- bind_rows( new_hh_seed, new_hh_seed %>% mutate(geo_tract = 2, id = id + 8) ) new_hh_seed$geo_region <- 1 new_hh_seed # Repeat the household targets for two tracts new_hh_targets <- hh_targets new_hh_targets$hhtype <- bind_rows(hh_targets$hhtype, hh_targets$hhtype) new_hh_targets$hhtype <- new_hh_targets$hhtype %>% mutate(geo_tract = c(1, 2)) new_hh_targets$hhtype # Repeat the per_seed to create tract 1 and 2 persons new_per_seed <- bind_rows( per_seed, per_seed %>% mutate(id = id + 8) ) new_per_seed # Double the regional person targets new_per_targets <- per_targets new_per_targets$pertype <- per_targets$pertype %>% mutate_all(list(~. * 2)) %>% mutate(geo_region = 1) new_per_targets$pertype
Run the IPU algorithm. Again, for vignette build time, only 30 iterations are
performed. Run the code yourself with max_iterations
set to 600 to see the
converged result.
result <- ipu( new_hh_seed, new_hh_targets, new_per_seed, new_per_targets, max_iterations = 30 )
The tables below show the results compared back to targets.
result$primary_comp result$secondary_comp
ipu
offers some advanced options for controlling the behavior. Weights can
be capped with min_ratio
and max_ratio
options. Secondary target importance
can be adjusted using the secondary_importance
option. For info on using
these options appropriately, see the common_ipf_problems
vignette.
The weight_tbl
from ipu()
can be fed into the synthesize()
function to
generate a synthetic population. The code block below takes the output from the
previous example to demonstrate. The group_by
option tells the function to
group by the geo_tract
field before random sampling takes places. This will
ensure that the number of sampled records in each tract equals the total weight
in each tract.
A new_id
is created for each synthetic member (usually a household), but the
old ID is preserved to facilitate joins or other operations where it could be
useful.
set.seed(42) synthesize(result$weight_tbl, group_by = "geo_tract") %>% head()
A tidyverse-style call also works, as shown below.
set.seed(42) result$weight_tbl %>% group_by(geo_tract) %>% synthesize() %>% head()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.