knitr::opts_chunk$set(cache=FALSE,echo=TRUE,
                      message=TRUE,warning=TRUE,error=TRUE)
options(scipen=999) # removes sci notation

Introduction

This package provides a generic implimentation of the iterative proportional fitting algorithm or IPF in the ipf() function. It also provides an iterative proportional updating algorithm based on on the paper from Arizona State University (IPU) for balancing household- and person-level marginals in the ipu() function.

library(dplyr)
library(tidyr)
library(ipfr)

IPU

Iterative proportional updating is a method developed by Arizona State University that allows the IPF procedure to match household- and person-level marginals. In the basic IPF procedure, all marginal distributions must describe the same thing (e.g. households). IPU allows you to say, for example, that a zone needs a total household count of 500, but also needs 800 people.

Example 1: Simple Example

This example creates a random seed table and target values to illustrate how the package is used. The targets are specified for two separate geographies (geo_clusters). Any field name can be used as long as it:

This simple example only has one target marginal distribution, and could be solved directly without ipu. However, it is designed to show the basics needed to run the function.

Seed table creation

The seed table is the starting point for the IPF procedure. In this example, we make up some survey data.

hh_seed <- tribble(
  ~pid, ~siz, ~inc, ~weight, ~geo_taz,
  1,    1,    1,    12,       1,
  2,    1,    2,     3,       1,
  3,    2,    1,    6,       1,
  4,    2,    2,    5,       1
)

Target creation

The number of households by size (e.g., 1-person, 2-person, etc.) is referred to as a marginal distribution. Often, from the Census, we know the total number of households by each individual marginal. This information becomes the target that the IPU process tries to match.

Marginal targets are specified below for each taz:

hh_targets <- list()
hh_targets$siz <- tribble(
  ~geo_taz, ~`1`, ~`2`,
  1,        18,   12
)
hh_targets$inc <- tribble(
  ~geo_taz, ~`1`, ~`2`,
  1,        20,  10
)

hh_targets

Run IPU

result <- ipu(hh_seed, hh_targets)

ipu() returns a named list.

names(result)

The first element is the resulting weight table. It is the primary seed table with three columns added:

result$weight_tbl

The second element is a histogram of the weight_factor. This provides a quick overview of the distribution of weights.

result$weight_dist

The next element is a comparison back to the targets provided. With complex seed and target tables, this makes investigating results quick and easy.

result$primary_comp

If secondary targets are provided to ipu(), a fourth item in the list will contain a secondary_comp table.

In addition to making sure the marginal targets are matched, it is important to ensure that the underlying distribution of households still resembles the seed data. As an example, if your seed data says that most low-income households are also one-person households, that information should be preserved.

hh_seed %>%
  mutate(inc = paste0("inc", inc)) %>%
  filter(geo_taz == 1) %>%
  select(siz, inc, weight) %>%
  spread(inc, weight)

result$weight_tbl %>%
  mutate(inc = paste0("inc", inc)) %>%
  filter(geo_taz == 1) %>%
  select(siz, inc, weight) %>%
  spread(inc, weight)

Example 2: The Arizona Paper Example

In household survey expansion, it is common to want to control for certain features that describe households, (like size), while controlling for other attributes that describe people (like age). This is possible with the ipu() function.

This example is taken directly from the Arizona paper on page 20: http://www.scag.ca.gov/Documents/PopulationSynthesizerPaper_TRB.pdf

In this example, household type could represent size (e.g. 1-person and 2-person households). Person type could represent age groups (e.g. under 18, between 18 and 50, and over 50).

The code block below re-creates the seed and target tables for both persons and households.

hh_seed <- data_frame(
  geo_region = 1,
  pid = c(1:8),
  hhtype = c(1, 1, 1, 2, 2, 2, 2, 2)
)

per_seed <- data_frame(
  pid = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8),
  pertype = c(1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 3, 2, 2, 3, 1, 2, 1, 1, 2, 3, 3, 1, 2)
)

hh_targets <- list()
hh_targets$hhtype <- data_frame(
  geo_region = 1,
  `1` = 35,
  `2` = 65
)

per_targets <- list()
per_targets$pertype <- data_frame(
  geo_region = 1,
  `1` = 91,
  `2` = 65,
  `3` = 104
)

In the interst of keeping vignette build time short, the ipu() algorithm is only run for 30 iterations. After running for 400 or more iterations, the results match closely to those shown in the paper.

result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30)

The first table shows the result. The second table shows the primary comparison table. Since we added secondary seeds and targets, the output now contains a secondary comparison table. Feel free to run the code chunk above for 400 or more iterations and then look again.

result$weight_tbl %>%
  mutate(weight = round(weight, 2))

result$primary_comp %>% 
  mutate(result = round(result, 2))

result$secondary_comp %>% 
  mutate(result = round(result, 2))

Example 3: Using Multiple Geographies

ipu() allows different geographies to be specified for different marginal tables. There are a few rules that make this possible, but in short, the geo field on each target table tells the algorithm which scale to constrain to.

All of the following rules are checked by the algorithm a warning message will show if one is violated.

To demonstrate, the Arizona example from example 1 is modified to add two different clusters for household controls but to still control the person targets at the regional level.

# Modifying example 1 for example 2

# Repeat the hh_seed to create cluster 1 and 2 households
hh_seed <- hh_seed %>%
  rename(geo_cluster = geo_region)
hh_seed <- bind_rows(
  hh_seed,
  hh_seed %>% 
    mutate(geo_cluster = 2, pid = pid + 8)
)
hh_seed$geo_region = 1

hh_seed

# Repeat the household targets for two clusters
hh_targets$hhtype <- bind_rows(hh_targets$hhtype, hh_targets$hhtype)
hh_targets$hhtype <- hh_targets$hhtype %>%
  rename(geo_cluster = geo_region) %>%
  mutate(geo_cluster = c(1, 2))

hh_targets$hhtype

# Repeat the per_seed to create cluster 1 and 2 persons
per_seed <- bind_rows(
  per_seed,
  per_seed %>% 
    mutate(pid = pid + 8)
)

per_seed %>%
  head()

# Double the regional person targets
per_targets$pertype <- per_targets$pertype %>%
  mutate_at(
    .vars = vars("1", "2", "3"),
    .funs = funs(. * 2)
  )

per_targets$pertype

Run the IPU algorithm. Again, for vignette build time, only 30 iterations are performed. Run the code yourself with max_iterations set to 600 to see the converged result.

result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30)

The tables below show the results compared back to targets. More iterations would make a better match.

result$primary_comp %>%
  mutate(result = round(result, 2))

result$secondary_comp %>%
  mutate(result = round(result, 2))

How IPU addresses common IPF problems

This section will show how ipu() addresses some common problems found in basic ipf procedures. It uses the example data from the first example.

Zero weights

IPF works by successively multiplying the table weights by factors. Cells with a zero weight cannot be modified by this process. As the number of zero weights increase, the flexibility of the process is reduced, and convergence becomes more difficult. ipfr solves this problem by setting a minimum weight for all cells to .0001. This minimum weight can be adjusted using the min_weight parameter and should be arbitrarily small compared to your seed table weights.

Missing seed information

Not every combination of marginal categories is required to be included in the seed table; however, at least one observation of each category must exist. For example, the combination:

may not have been observed in the survey, and thus may be missing from the seed table. As long as other combinations of size-1 households exist (e.g. with 0 workers and 1 vehicle), ipfr will work fine. On the other hand, if there are no observations of any size-1 households, ipfr will stop with an error message.

See the first IPU example to see how it works.

Target agreement

ipfr handles two separate issues concerning marginal agreement:

Agreement within Primary or Secondary Targets

A basic implementation of iterative proportional fitting requires that all targets agree on the total. For example, if the households by size target table has a total of 100 households, but the households by income table has a total of 120, both cannot be satisfied.

ipfr handles this by scaling all tables in the same target list (either primary or secondary) to match the total of the first table.

In the example below, the size marginal sums to a total of 100 households. The vehicle marginal sums to 300. With the verbose option set to TRUE, a message will be displayed telling which, if any, target tables are scaled.

hh_seed <- data_frame(
  geo_region = 1,
  pid = c(1:8),
  hhsiz = c(1, 1, 1, 2, 2, 2, 2, 2),
  hhveh = c(0, 2, 1, 1, 1, 2, 1, 0)
)

hh_targets <- list()
hh_targets$hhsiz <- data_frame(
  geo_region = 1,
  `1` = 35,
  `2` = 65
)
hh_targets$hhveh <- data_frame(
  geo_region = 1,
  `0` = 100,
  `1` = 100,
  `2` = 100
)

result <- ipu(hh_seed, hh_targets, max_iterations = 30, verbose = TRUE)

Importantly, the performance measures below compare the result to the scaled target not the original. Note that the vehicle targets have been scaled down.

result$primary_comp %>%
  mutate_at(
    .vars = vars(target, result),
    .funs = funs(round(., 2))
  )

Balance Between Primary and Secondary Targets

In population synthesis or survey expansion, adding a secondary set of person- level targets can lead to a different issue: target balance. Naturally, the total number of households and the total number of persons will be very different. A balance issue arises when the average weight for household records and person records are very different.

In the Arizona example, note that the average weights for household and person records are similar.

hh_seed <- data_frame(
  geo_region = 1,
  pid = c(1:8),
  hhtype = c(1, 1, 1, 2, 2, 2, 2, 2)
)

per_seed <- data_frame(
  pid = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8),
  pertype = c(1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 3, 2, 2, 3, 1, 2, 1, 1, 2, 3, 3, 1, 2)
)

hh_targets <- list()
hh_targets$hhtype <- data_frame(
  geo_region = 1,
  `1` = 35,
  `2` = 65
)

per_targets <- list()
per_targets$pertype <- data_frame(
  geo_region = 1,
  `1` = 91,
  `2` = 65,
  `3` = 104
)
avg_hh_weight <- (rowSums(hh_targets$hhtype) - 1) / nrow(hh_seed)
avg_per_weight <- (rowSums(per_targets$pertype) - 1) / nrow(per_seed)

In real applications, this is often not true. The example below demonstrates the consequences by modifying the Arizona to double the person targets.

per_targets$pertype <- per_targets$pertype %>%
  mutate_at(
    .vars = vars(`1`, `2`, `3`),
    .funs = funs(. * 2)
  )

result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30)

The resulting weights tend towards the extreme as the algorithm attempts to match unbalanced primary and secondary targets. In effect, the algorithm is making a large shift to the basic persons-per-household metric found in the seed table. Households with mutiple people get large weights, while households with a single person get small weights.

result$weight_dist

ipu can fix the underlying problem using the secondary_importance argument. It is 1 by default, which means the algorithm will attempt to match the absolute values of the secondary targets (as above). As this value is decreased to 0, the secondary targets are scaled to match the average weight of the primary targets.

The examples below set secondary_importance to 0.80, 0.20, and 0.00 to show the effect on results. With each decrease in importance, the match to person targets gets worse, but weight extremes are reduced.

result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30,
              secondary_importance = .80)

result
result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30,
              secondary_importance = .20)

result
result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30,
              secondary_importance = 0)

result

Extreme Weights

Often, it is preferable to constrain weights so that certain, under-sampled observations to do not end up with extreme weights. ipu() supports this by using the min_ratio and max_ratio variables.

First, the average weight is calculated per geography based on the total of the target tables divided by the number of records in the seed table. Then, the max and min factors set a cap and floor based on a multiple of that average.

Common values to use are:

However, care should be taken when moving these variables from their default values. These variables impose another constraint on the algorithm and increase the chance of failure. In the example below, very strict values are used with the same seed and target data from IPU Example 1.

Values of 1.2 and .8 mean that all weights must be within 20% of the average weight.

hh_seed <- data_frame(
  pid = c(1, 2, 3, 4),
  siz = c(1, 2, 2, 1),
  weight = c(1, 1, 1, 1),
  geo_cluster = c(1, 1, 2, 2)
)

hh_targets <- list()
hh_targets$siz <- data_frame(
  geo_cluster = c(1, 2),
  `1` = c(75, 100),
  `2` = c(25, 150)
)

result <- ipu(hh_seed, hh_targets, max_iterations = 10,
              max_ratio = 1.2, min_ratio = .8)

Consider the effect on geo_cluster 1. With a total target of 100 households and two records in the seed table, the average weight is 50. This means that the weights must be between 40 and 60. The algorithm does not have enough flexibility to meet the controls.

result$primary_comp

A second problem can arrise from capping weights based on the average weight. In the example below, I change the targets so that, for geo_cluster 1, they are very unbalanced. Cluster 1 now has 100,000 1-person households but only 5 2-person households.

hh_targets <- list()
hh_targets$siz <- data_frame(
  geo_cluster = c(1, 2),
  `1` = c(100000, 100),
  `2` = c(5, 150)
)

result <- ipu(hh_seed, hh_targets, max_iterations = 10,
              max_ratio = 5, min_ratio = .2)

result$primary_comp

Even with reasonable values for the weight caps, the minimum allowable weight is much higher than 5. This is an extreme example, and is unlikely to be an issue in applications related to housing and population - the targets are generally on the same scale. However, when expanding a through-trip table, it is common to have some external stations with large targets and others with small. In these cases, it is advisable to leave the scale arguments at their default values.

IPU_NR

The function ipu_nr only differs from ipu in one significant way: the method used to balance primary and secondary targets.

hh_seed <- data_frame(
  geo_region = 1,
  pid = c(1:8),
  hhtype = c(1, 1, 1, 2, 2, 2, 2, 2)
)

per_seed <- data_frame(
  pid = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8),
  pertype = c(1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 3, 2, 2, 3, 1, 2, 1, 1, 2, 3, 3, 1, 2)
)

hh_targets <- list()
hh_targets$hhtype <- data_frame(
  geo_region = 1,
  `1` = 35,
  `2` = 65
)

per_targets <- list()
per_targets$pertype <- data_frame(
  geo_region = 1,
  `1` = 91,
  `2` = 65,
  `3` = 104
)

As in the more detailed ipu example above, we modify the Arizona example (which is balanced) to double the person targets. This creates a significant imbalance that standard approahces struggle with.

per_targets$pertype <- per_targets$pertype %>%
  mutate_at(
    .vars = vars(`1`, `2`, `3`),
    .funs = funs(. * 2)
  )

While ipu balances the secondary targets directly using secondary_importance, ipu_nr uses an iterative approach and the target_priority argument.

By default, all target tables have an equally high priority, which means that the algorithm will attempt to match all targets exactly. However, target_priority can be modified in several ways. In the code below, a data frame is used to assign the hhtype target a higher priority. (If using a data frame, the column names must be target and priority.) A simple named list can also be used (both options shown below).

# Option 1: a data frame
target_priority <- data_frame(
  target = c("hhtype", "pertype"),
  priority = c(10000, 10)
)

# Options 2: use a named list
target_priority <- list()
target_priority$hhtype <- 10000
target_priority$pertype <- 10

result <- ipu_nr(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30,
              target_priority = target_priority)

As ipu_nr runs, it relaxes the target constraints on pertype much faster than on hhtype. As a result, the final weights will match the household type much closer. The two methods generally match targets to the same degree, but often lead to very different distributions of weight ratios. In addition, ipu tends to reach convergence levels around .1 %RMSE faster than ipu_nr, but for levels below that, ipu_nr tends to be faster.

result


pbsag/ipfr documentation built on May 24, 2019, 10:38 p.m.