knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
suppressPackageStartupMessages({ library(ipfr) library(dplyr) })
This vignette demonstrates how ipu()
addresses common problems found in basic
ipf approaches.
IPF works by successively multiplying table/matrix weights by factors. Cells
with a zero weight cannot be modified by this process and always remain at 0.
As the number of zero weights increase, the flexibility of the process is
reduced, and convergence becomes more difficult. ipfr
solves this problem by
setting a minimum weight for all cells to .0001
. This minimum weight can be
adjusted using the min_weight
parameter and should be arbitrarily small
compared to your seed table weights.
Not every combination of marginal categories is required to be included in the seed table; however, at least one observation of each category must exist. For example, the combination:
may not have been observed in the survey, and thus may be missing from the seed
table. As long as other combinations of size-1 households exist (e.g. with 0
workers and 1 vehicle), ipfr
will work fine. On the other hand, if there are
no observations of any size-1 households, ipfr
will stop with an error
message.
ipfr
handles two separate issues concerning marginal agreement:
A basic implementation of iterative proportional fitting requires that all targets agree on the total. For example, if the households by size target table has a total of 100 households, but the households by income table has a total of 120, both cannot be satisfied. The process will not converge.
ipfr
handles this by scaling all tables in the same target list (e.g.
primary_targets
) to match the total of the first table.
In the example below, the size marginal sums to a total of 100 households.
The vehicle marginal sums to 300. With the verbose
option set to TRUE
,
a message will be displayed telling which, if any, target tables are scaled.
hh_seed <- tibble( geo_region = 1, id = c(1:8), hhsiz = c(1, 1, 1, 2, 2, 2, 2, 2), hhveh = c(0, 2, 1, 1, 1, 2, 1, 0) ) hh_targets <- list() hh_targets$hhsiz <- tibble( geo_region = 1, `1` = 35, `2` = 65 ) hh_targets$hhveh <- tibble( geo_region = 1, `0` = 100, `1` = 100, `2` = 100 ) result <- ipu(hh_seed, hh_targets, max_iterations = 30, verbose = TRUE)
Importantly, the performance measures below compare the result to the scaled target not the original. Note that the vehicle targets have been scaled down.
result$primary_comp
In population synthesis or survey expansion, adding a secondary set of person-level targets can lead to a different issue: target balance. Naturally, the total number of households and the total number of persons will be very different. A balance issue only arises when the average weight for household records and person records are very different. That is, when the total of household targets divided by household records is very different from the total of person targets divided by person records.
This example is taken directly from the Arizona paper on page 20: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.723&rep=rep1&type=pdf
In this example, household type could represent size (e.g. 1-person and 2-person households). Person type could represent age groups (e.g. under 18, between 18 and 50, and over 50).
setup_arizona()
creates the seed and target tables used in the example.
result <- setup_arizona() hh_seed <- result$hh_seed hh_targets <- result$hh_targets per_seed <- result$per_seed per_targets <- result$per_targets avg_hh_weight <- (rowSums(hh_targets$hhtype) - 1) / nrow(hh_seed) avg_per_weight <- (rowSums(per_targets$pertype) - 1) / nrow(per_seed)
Note that the average weights are similar.
r avg_hh_weight
r round(avg_per_weight, 2)
In real applications, this is often not true. The example below demonstrates the consequences by modifying the Arizona to double the person targets.
new_per_targets <- per_targets new_per_targets$pertype <- per_targets$pertype %>% mutate_at( .vars = vars(`1`, `2`, `3`), .funs = list(~. * 2) ) result <- ipu(hh_seed, hh_targets, per_seed, new_per_targets, max_iterations = 30)
The resulting weights tend towards the extreme as the algorithm attempts to match unbalanced primary and secondary targets. In effect, the algorithm is making a large shift to the basic persons-per-household metric found in the household seed. Households with multiple people get large weights, while households with a one person get small weights.
result$weight_dist
ipu
can fix the underlying problem using the secondary_importance
argument.
It is 1
by default, which means the algorithm will attempt to match the
absolute values of the secondary targets (as above). As this value is decreased
to 0, the secondary targets are scaled to match the average weight of the
primary targets.
The examples below set secondary_importance
to 0.80
and 0.20
to
show the effect on results. As secondary importance decreases, the match to
person targets gets worse; however, the relative distribution of persons
still match closely. The distribution of weights also improves.
Note: for package build time, max iterations is capped at 30. While the impact of the factor can still be seen, consider running for 100 iterations and comparing the results.
result_80 <- ipu( hh_seed, hh_targets, per_seed, new_per_targets, max_iterations = 30, secondary_importance = .80 ) result_80$weight_dist result_80$primary_comp result_80$secondary_comp
result_20 <- ipu( hh_seed, hh_targets, per_seed, new_per_targets, max_iterations = 30, secondary_importance = .20 ) result_20$weight_dist result_20$primary_comp result_20$secondary_comp
Often, it is preferable to constrain weights so that certain, under-sampled
observations to do not end up with extreme weights. ipu()
supports this by
using the min_ratio
and max_ratio
variables. The easiest way to see the
effect of these variables is in the weight_dist
histogram. No columns will
appear outside the min/max ratios.
Common values to use are:
Note: weight ratios are calculated by geography.
Care should be taken when moving these variables from their default values. These variables impose another constraint on the algorithm and increase run time and the chance of failure. In the example below, strict ratio values of 1.2 and .8 mean that all weights must be within 20% of the average weight.
hh_seed <- tibble( id = c(1, 2, 3, 4), siz = c(1, 2, 2, 1), weight = c(1, 1, 1, 1), geo_cluster = c(1, 1, 2, 2) ) hh_targets <- list() hh_targets$siz <- tibble( geo_cluster = c(1, 2), `1` = c(75, 100), `2` = c(25, 150) ) result <- ipu(hh_seed, hh_targets, max_iterations = 10, max_ratio = 1.2, min_ratio = .8)
Consider the effect on geo_cluster 1. With a total target of 100 households and
two records in the seed table, the average weight is 50. This means that the
final weights are constrained between 40 and 60 by the min_ratio
and
max_ratio
. The weight distribution histogram confirms that the caps were
respected.
result$weight_dist
However, note that all weights are set to either the maximum or minimum possible. The algorithm does not have enough flexibility to meet the controls, which is shown by looking at the comparison table.
result$primary_comp
A second problem can arise when using these ratios. In the example below, I
change the targets so that, for geo_cluster 1, they are very unbalanced. Cluster
1 now has 100,000 1-person households but only 10 2-person households. This
means the average weight for that cluster will be r format((100000 + 10) / 2,
digits = 2, big.mark = ",")
and the minimum weight will be r format((100000 +
10) / 2 * .2, digits = 2, big.mark = ",")
. The minimum weight is larger
2-person target of 10.
hh_targets <- list() hh_targets$siz <- tibble( geo_cluster = c(1, 2), `1` = c(100000, 100), `2` = c(10, 150) ) result <- ipu(hh_seed, hh_targets, max_iterations = 10, max_ratio = 5, min_ratio = .2) result$weight_tbl result$primary_comp
This is an extreme example, and is unlikely to be an issue in applications related to housing and population. In these applications, the targets are generally on the same order of magnitude. In other applications, like expanding a through-trip table to traffic counts, it is more common to have some external stations with large targets (freeways) and others with small (arterials). In these cases, it is advisable to leave the scale arguments at their default values.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.