swc_get_mapping: Compute a matching table between two lists of municipality...

View source: R/swc_get_mapping.R

swc_get_mappingR Documentation

Compute a matching table between two lists of municipality IDs

Description

For two lists of Swiss municipality IDs at any two points in time, this function creates a data frame with two columns where each row represents a match between municipality IDs. This can be used as an intermediate table for merging two data sets with municipality identifiers taken at different, possibly unknown, points in time.

Usage

swc_get_mapping(ids_from, ids_to)

Arguments

ids_from

A list of "source" municipality IDs, preferably a factor

ids_to

A list of "target" municipality IDs, preferably a factor

Details

It is advisable to use factors as list of municipality IDs. By that, comparisons and merges for municipality IDs are automatically checked for compatibility.

Note that the "from" list must be from an earlier time than the "to" list. Trying to compute the mapping the other way round results in an error. This is intentional: As municipalities are usually merged, it makes sense to use the most recent data set as target for the mapping. This can also be a file with suitable geometries to allow for visualization.

For two lists of municipalities, we construct a mapping from the first list to the second. First, the most probable mutation number in the "municipality mutations" data set is computed.

Value

A data frame with columns prefixed by from. and to that represents the computed match. The municipality IDs are stored in the columns from.mId and to.mId. The columns from.MergeType and to.MergeType contain valid if the municipality is contained in both the input and the mapping table, missing if the municipality is missing from the input, and extra if the municipality is in the input but not in the mapping table; most columns are NA for such rows. In addition, the column MergeType offers a summary of the "from" and "to" status: Rows with values other than "valid" or "missing" should be examined.

Examples

library(dplyr)
data(SwissPop)
data(SwissBirths)

# Show mismatch of municipality IDs:
ids_from <- with(SwissPop, MunicipalityID)
ids_to <- with(SwissBirths, MunicipalityID)
setdiff(ids_from, ids_to)
setdiff(ids_to, ids_from)

# Compute mapping and count non-matching municipality IDs:
mapping <- swc_get_mapping(ids_from = ids_from, ids_to = ids_to)
with(mapping, sum(mapping$mIdAsNumber.from != mapping$mIdAsNumber.to))

# Communes that are "missing" are mostly lakes and other special communes:
subset(mapping, MatchType == "missing")[, c("mIdAsNumber.from", "mShortName.from")]

# These should be looked at in some detail, and fixed manually:
subset(mapping, !(MatchType %in% c("valid", "missing")))

# Test for injectivity. The result shows that the mapping is almost injective,
# only one "from" commune is mapped to more than one other "to" commune.
# This situation requires further examination.
mapping.dupes <- subset(mapping, duplicated(mIdAsNumber.from))
(noninjective.mapping <- subset(
  mapping, mIdAsNumber.from %in% mapping.dupes$mIdAsNumber.from
))

# Simple treatment (just for this example): Remove duplicates, and use only
# valid matches:
cleaned.mapping <- subset(
  mapping,
  !duplicated(mIdAsNumber.from) & MatchType == "valid"
)

# Now merge the two datasets based on the mapping table:
SwissPop.1970 <- subset(SwissPop, Year == "1970")
SwissPopMapping.1970 <- merge(SwissPop.1970,
  cleaned.mapping[, c("mId.from", "mId.to")],
  by.x = "MunicipalityID", by.y = "mId.from"
)

# Datasets from the "from" table must be suitably aggregated.  For the given
# case of population totals we use the sum.
SwissPopMapping.1970.agg <- group_by(
  SwissPopMapping.1970,
  mId.to,
  HouseholdSize
) %>%
  summarize(Households = sum(Households))
with(SwissPopMapping.1970.agg, stopifnot(
  length(unique(mId.to)) * length(levels(HouseholdSize)) ==
    length(mId.to)
))

# The aggregated "from" dataset now can be merged with the "to" dataset:
SwissBirths.1970 <- subset(SwissBirths, Year == "1970")
SwissPopBirths.1970 <- merge(SwissPopMapping.1970.agg, SwissBirths.1970,
  by.x = "mId.to", by.y = "MunicipalityID"
)

# Some more communes are still missing from the 1970 statistics, although
# the matches are valid:
subset(mapping, mIdAsNumber.to %in% setdiff(
  SwissPopMapping.1970.agg$mId.to, SwissBirths.1970$MunicipalityID
))[
  ,
  c("mId.from", "mShortName.from", "MatchType")
]

# The "from" list must be from an earlier time than the "to" list.
try(swc_get_mapping(ids_from = ids_to, ids_to = ids_from))

cynkra/munch documentation built on Dec. 15, 2024, 6:06 a.m.