knitr::opts_chunk$set(cache=FALSE,echo=TRUE, message=TRUE,warning=TRUE,error=TRUE) options(scipen=999) # removes sci notation
This package provides a generic implimentation of the iterative proportional
fitting algorithm or
IPF in the
ipf()
function. It also provides an iterative proportional updating algorithm
based on on the paper from Arizona State University
(IPU) for
balancing household- and person-level marginals in the ipu()
function.
library(dplyr) library(tidyr) library(ipfr)
Iterative proportional updating is a method developed by Arizona State University that allows the IPF procedure to match household- and person-level marginals. In the basic IPF procedure, all marginal distributions must describe the same thing (e.g. households). IPU allows you to say, for example, that a zone needs a total household count of 500, but also needs 800 people.
This example creates a random seed table and target values to illustrate how the
package is used. The targets are specified for two separate geographies
(geo_clusters
). Any field name can be used as long as it:
This simple example only has one target marginal distribution, and could be
solved directly without ipu
. However, it is designed to show the basics needed
to run the function.
The seed table is the starting point for the IPF procedure. In this example, we make up some survey data.
pid
("primary ID")geo_taz
)hh_seed <- tribble( ~pid, ~siz, ~inc, ~weight, ~geo_taz, 1, 1, 1, 12, 1, 2, 1, 2, 3, 1, 3, 2, 1, 6, 1, 4, 2, 2, 5, 1 )
The number of households by size (e.g., 1-person, 2-person, etc.) is referred to as a marginal distribution. Often, from the Census, we know the total number of households by each individual marginal. This information becomes the target that the IPU process tries to match.
Marginal targets are specified below for each taz:
geo_taz
matches the seed table.siz
and inc
columns of the seed.siz
and inc
columns.hh_targets <- list() hh_targets$siz <- tribble( ~geo_taz, ~`1`, ~`2`, 1, 18, 12 ) hh_targets$inc <- tribble( ~geo_taz, ~`1`, ~`2`, 1, 20, 10 ) hh_targets
result <- ipu(hh_seed, hh_targets)
ipu()
returns a named list.
names(result)
The first element is the resulting weight table. It is the primary seed table with three columns added:
result$weight_tbl
The second element is a histogram of the weight_factor
. This provides a quick
overview of the distribution of weights.
result$weight_dist
The next element is a comparison back to the targets provided. With complex seed and target tables, this makes investigating results quick and easy.
result$primary_comp
If secondary targets are provided to ipu()
, a fourth item in the list will
contain a secondary_comp
table.
In addition to making sure the marginal targets are matched, it is important to ensure that the underlying distribution of households still resembles the seed data. As an example, if your seed data says that most low-income households are also one-person households, that information should be preserved.
hh_seed %>% mutate(inc = paste0("inc", inc)) %>% filter(geo_taz == 1) %>% select(siz, inc, weight) %>% spread(inc, weight) result$weight_tbl %>% mutate(inc = paste0("inc", inc)) %>% filter(geo_taz == 1) %>% select(siz, inc, weight) %>% spread(inc, weight)
In household survey expansion, it is common to want to control for certain
features that describe households, (like size), while controlling for other
attributes that describe people (like age). This is possible with the ipu()
function.
This example is taken directly from the Arizona paper on page 20: http://www.scag.ca.gov/Documents/PopulationSynthesizerPaper_TRB.pdf
In this example, household type could represent size (e.g. 1-person and 2-person households). Person type could represent age groups (e.g. under 18, between 18 and 50, and over 50).
The code block below re-creates the seed and target tables for both persons and households.
geo_region
)pid
fieldpid
field in the persons seed table links to the household seedhh_seed <- data_frame( geo_region = 1, pid = c(1:8), hhtype = c(1, 1, 1, 2, 2, 2, 2, 2) ) per_seed <- data_frame( pid = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8), pertype = c(1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 3, 2, 2, 3, 1, 2, 1, 1, 2, 3, 3, 1, 2) ) hh_targets <- list() hh_targets$hhtype <- data_frame( geo_region = 1, `1` = 35, `2` = 65 ) per_targets <- list() per_targets$pertype <- data_frame( geo_region = 1, `1` = 91, `2` = 65, `3` = 104 )
In the interst of keeping vignette build time short, the ipu()
algorithm is
only run for 30 iterations. After running for 400 or more iterations, the
results match closely to those shown in the paper.
primary_seed
primary_target
secondary_seed
secondary_target
result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30)
The first table shows the result. The second table shows the primary comparison table. Since we added secondary seeds and targets, the output now contains a secondary comparison table. Feel free to run the code chunk above for 400 or more iterations and then look again.
result$weight_tbl %>% mutate(weight = round(weight, 2)) result$primary_comp %>% mutate(result = round(result, 2)) result$secondary_comp %>% mutate(result = round(result, 2))
ipu()
allows different geographies to be specified for different marginal
tables. There are a few rules that make this possible, but in short, the geo
field on each target table tells the algorithm which scale to constrain to.
All of the following rules are checked by the algorithm a warning message will show if one is violated.
pid
field.To demonstrate, the Arizona example from example 1 is modified to add two different clusters for household controls but to still control the person targets at the regional level.
# Modifying example 1 for example 2 # Repeat the hh_seed to create cluster 1 and 2 households hh_seed <- hh_seed %>% rename(geo_cluster = geo_region) hh_seed <- bind_rows( hh_seed, hh_seed %>% mutate(geo_cluster = 2, pid = pid + 8) ) hh_seed$geo_region = 1 hh_seed # Repeat the household targets for two clusters hh_targets$hhtype <- bind_rows(hh_targets$hhtype, hh_targets$hhtype) hh_targets$hhtype <- hh_targets$hhtype %>% rename(geo_cluster = geo_region) %>% mutate(geo_cluster = c(1, 2)) hh_targets$hhtype # Repeat the per_seed to create cluster 1 and 2 persons per_seed <- bind_rows( per_seed, per_seed %>% mutate(pid = pid + 8) ) per_seed %>% head() # Double the regional person targets per_targets$pertype <- per_targets$pertype %>% mutate_at( .vars = vars("1", "2", "3"), .funs = funs(. * 2) ) per_targets$pertype
Run the IPU algorithm. Again, for vignette build time, only 30 iterations are
performed. Run the code yourself with max_iterations
set to 600 to see the
converged result.
result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30)
The tables below show the results compared back to targets. More iterations would make a better match.
result$primary_comp %>% mutate(result = round(result, 2)) result$secondary_comp %>% mutate(result = round(result, 2))
This section will show how ipu()
addresses some common problems found in basic
ipf procedures. It uses the example data from the first example.
IPF works by successively multiplying the table weights by factors. Cells with
a zero weight cannot be modified by this process. As the number of zero weights
increase, the flexibility of the process is reduced, and convergence becomes
more difficult. ipfr
solves this problem by setting a minimum weight for all
cells to .0001
. This minimum weight can be adjusted using the min_weight
parameter and should be arbitrarily small compared to your seed table weights.
Not every combination of marginal categories is required to be included in the seed table; however, at least one observation of each category must exist. For example, the combination:
may not have been observed in the survey, and thus may be missing from the seed
table. As long as other combinations of size-1 households exist (e.g. with 0 workers and 1 vehicle), ipfr
will work fine. On the other hand, if there are no observations of any size-1 households, ipfr
will stop with an error message.
See the first IPU example to see how it works.
ipfr
handles two separate issues concerning marginal agreement:
A basic implementation of iterative proportional fitting requires that all targets agree on the total. For example, if the households by size target table has a total of 100 households, but the households by income table has a total of 120, both cannot be satisfied.
ipfr
handles this by scaling all tables in the same target list (either
primary or secondary) to match the total of the first table.
In the example below, the size marginal sums to a total of 100 households.
The vehicle marginal sums to 300. With the verbose
option set to TRUE
,
a message will be displayed telling which, if any, target tables are scaled.
hh_seed <- data_frame( geo_region = 1, pid = c(1:8), hhsiz = c(1, 1, 1, 2, 2, 2, 2, 2), hhveh = c(0, 2, 1, 1, 1, 2, 1, 0) ) hh_targets <- list() hh_targets$hhsiz <- data_frame( geo_region = 1, `1` = 35, `2` = 65 ) hh_targets$hhveh <- data_frame( geo_region = 1, `0` = 100, `1` = 100, `2` = 100 ) result <- ipu(hh_seed, hh_targets, max_iterations = 30, verbose = TRUE)
Importantly, the performance measures below compare the result to the scaled target not the original. Note that the vehicle targets have been scaled down.
result$primary_comp %>% mutate_at( .vars = vars(target, result), .funs = funs(round(., 2)) )
In population synthesis or survey expansion, adding a secondary set of person- level targets can lead to a different issue: target balance. Naturally, the total number of households and the total number of persons will be very different. A balance issue arises when the average weight for household records and person records are very different.
In the Arizona example, note that the average weights for household and person records are similar.
hh_seed <- data_frame( geo_region = 1, pid = c(1:8), hhtype = c(1, 1, 1, 2, 2, 2, 2, 2) ) per_seed <- data_frame( pid = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8), pertype = c(1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 3, 2, 2, 3, 1, 2, 1, 1, 2, 3, 3, 1, 2) ) hh_targets <- list() hh_targets$hhtype <- data_frame( geo_region = 1, `1` = 35, `2` = 65 ) per_targets <- list() per_targets$pertype <- data_frame( geo_region = 1, `1` = 91, `2` = 65, `3` = 104 )
avg_hh_weight <- (rowSums(hh_targets$hhtype) - 1) / nrow(hh_seed) avg_per_weight <- (rowSums(per_targets$pertype) - 1) / nrow(per_seed)
r avg_hh_weight
r round(avg_per_weight, 2)
In real applications, this is often not true. The example below demonstrates the consequences by modifying the Arizona to double the person targets.
per_targets$pertype <- per_targets$pertype %>% mutate_at( .vars = vars(`1`, `2`, `3`), .funs = funs(. * 2) ) result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30)
The resulting weights tend towards the extreme as the algorithm attempts to match unbalanced primary and secondary targets. In effect, the algorithm is making a large shift to the basic persons-per-household metric found in the seed table. Households with mutiple people get large weights, while households with a single person get small weights.
result$weight_dist
ipu
can fix the underlying problem using the secondary_importance
argument.
It is 1
by default, which means the algorithm will attempt to match the
absolute values of the secondary targets (as above). As this value is decreased
to 0, the secondary targets are scaled to match the average weight of the
primary targets.
The examples below set secondary_importance
to 0.80
, 0.20
, and 0.00
to
show the effect on results. With each decrease in importance, the match to
person targets gets worse, but weight extremes are reduced.
result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30, secondary_importance = .80) result
result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30, secondary_importance = .20) result
result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30, secondary_importance = 0) result
Often, it is preferable to constrain weights so that certain, under-sampled
observations to do not end up with extreme weights. ipu()
supports this by
using the min_ratio
and max_ratio
variables.
First, the average weight is calculated per geography based on the total of the target tables divided by the number of records in the seed table. Then, the max and min factors set a cap and floor based on a multiple of that average.
Common values to use are:
However, care should be taken when moving these variables from their default values. These variables impose another constraint on the algorithm and increase the chance of failure. In the example below, very strict values are used with the same seed and target data from IPU Example 1.
Values of 1.2 and .8 mean that all weights must be within 20% of the average weight.
hh_seed <- data_frame( pid = c(1, 2, 3, 4), siz = c(1, 2, 2, 1), weight = c(1, 1, 1, 1), geo_cluster = c(1, 1, 2, 2) ) hh_targets <- list() hh_targets$siz <- data_frame( geo_cluster = c(1, 2), `1` = c(75, 100), `2` = c(25, 150) ) result <- ipu(hh_seed, hh_targets, max_iterations = 10, max_ratio = 1.2, min_ratio = .8)
Consider the effect on geo_cluster 1. With a total target of 100 households and two records in the seed table, the average weight is 50. This means that the weights must be between 40 and 60. The algorithm does not have enough flexibility to meet the controls.
result$primary_comp
A second problem can arrise from capping weights based on the average weight. In the example below, I change the targets so that, for geo_cluster 1, they are very unbalanced. Cluster 1 now has 100,000 1-person households but only 5 2-person households.
hh_targets <- list() hh_targets$siz <- data_frame( geo_cluster = c(1, 2), `1` = c(100000, 100), `2` = c(5, 150) ) result <- ipu(hh_seed, hh_targets, max_iterations = 10, max_ratio = 5, min_ratio = .2) result$primary_comp
Even with reasonable values for the weight caps, the minimum allowable weight is much higher than 5. This is an extreme example, and is unlikely to be an issue in applications related to housing and population - the targets are generally on the same scale. However, when expanding a through-trip table, it is common to have some external stations with large targets and others with small. In these cases, it is advisable to leave the scale arguments at their default values.
The function ipu_nr
only differs from ipu
in one significant way: the method
used to balance primary and secondary targets.
hh_seed <- data_frame( geo_region = 1, pid = c(1:8), hhtype = c(1, 1, 1, 2, 2, 2, 2, 2) ) per_seed <- data_frame( pid = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8), pertype = c(1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 3, 2, 2, 3, 1, 2, 1, 1, 2, 3, 3, 1, 2) ) hh_targets <- list() hh_targets$hhtype <- data_frame( geo_region = 1, `1` = 35, `2` = 65 ) per_targets <- list() per_targets$pertype <- data_frame( geo_region = 1, `1` = 91, `2` = 65, `3` = 104 )
As in the more detailed ipu
example above, we modify the Arizona example
(which is balanced) to double the person targets. This creates a significant
imbalance that standard approahces struggle with.
per_targets$pertype <- per_targets$pertype %>% mutate_at( .vars = vars(`1`, `2`, `3`), .funs = funs(. * 2) )
While ipu
balances the secondary targets directly using secondary_importance
,
ipu_nr
uses an iterative approach and the target_priority
argument.
By default, all target tables have an equally high priority, which means that the
algorithm will attempt to match all targets exactly. However, target_priority
can be modified in several ways. In the code below, a data frame is used to
assign the hhtype
target a higher priority. (If using a data frame, the column
names must be target
and priority
.) A simple named list can also be used
(both options shown below).
# Option 1: a data frame target_priority <- data_frame( target = c("hhtype", "pertype"), priority = c(10000, 10) ) # Options 2: use a named list target_priority <- list() target_priority$hhtype <- 10000 target_priority$pertype <- 10 result <- ipu_nr(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30, target_priority = target_priority)
As ipu_nr
runs, it relaxes the target constraints on pertype
much faster than
on hhtype
. As a result, the final weights will match the household type much
closer. The two methods generally match targets to the same degree, but often
lead to very different distributions of weight ratios. In addition, ipu
tends
to reach convergence levels around .1 %RMSE faster than ipu_nr
, but for levels
below that, ipu_nr
tends to be faster.
result
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.