quickmatch: Derive generalized full matchings

View source: R/quickmatch.R

quickmatchR Documentation

Derive generalized full matchings

Description

quickmatch constructs near-optimal generalized full matchings. The function expects the user to provide distances measuring the similarity of units and a set of matching constraints. It then constructs a matching so that units assigned to the same group are as similar as possible while satisfying the matching constraints.

Usage

quickmatch(
  distances,
  treatments,
  treatment_constraints = NULL,
  size_constraint = NULL,
  target = NULL,
  caliper = NULL,
  ...
)

Arguments

distances

distances object or a numeric vector, matrix or data frame. The parameter describes the similarity of the units to be matched. It can either be preprocessed distance information using a distances object, or raw covariate data. When called with covariate data, Euclidean distances are calculated unless otherwise specified.

treatments

factor specifying the units' treatment assignments.

treatment_constraints

named integer vector with the treatment constraints. If NULL, the function ensures that each matched group contains one unit from each treatment condition.

size_constraint

integer with the required total number of units in each group. Must be greater or equal to the sum of treatment_constraints. If NULL, no constraints other than the treatment constraints are imposed.

target

units to target the matching for. All units indicated by target are ensured to be assigned to a matched group (disregarding eventual caliper setting). Units not indicated by target could be left unassigned if they are not necessary to satisfy the matching constraints. If NULL, quickmatch targets the complete sample and ensures that all units are assigned to a group. If target is a logical vector with the same length as the sample size, units indicated with TRUE will be targeted. If target is an integer vector, the units with indices in target are targeted. Indices starts at 1 and target must be sorted. If target is a character vector, it should contain treatment labels, and the corresponding units (as given by treatments) will be targeted.

caliper

restrict the maximum within-group distance.

...

additional parameters to be sent either to the distances function when the distances parameter contains covariate data, or to the underlying sc_clustering function.

Details

The treatment_constraints parameter should be a named vector with treatment-specific constraints. For example, in a sample with treatment conditions "A", "B" and "C", the vector c("A" = 1, "B" = 2, "C" = 0) specifies that each matched group should contain at least one unit with treatment "A", at least two units with treatment "B" and any number of units with treatment "C". Treatments not specified in the vector defaults to zero. For example, the vector c("A" = 1, "B" = 2) is identical to the previous one. When treatment_constraints is NULL, the function requires at least one unit for each treatment in each group. In our current example, NULL would be shorthand for c("A" = 1, "B" = 1, "C" = 1).

The size_constraint parameter can be used to constrain the matched groups to contain at least a certain number of units in total (independently of treatment assignment). For example, if treatment_constraints = c("A" = 1, "B" = 2) and total_size_constraint = 4, each matched group will contain at least one unit assigned to "A", at least two units assigned to "B" and at least four units in total, where the fourth unit can be from any treatment condition.

The target parameter can be used to control which units are included in the matching. When target is NULL (the default), all units will be assigned to a matched group. When not NULL, the parameter indicates that some units must be assigned to matched group and that the remaining units can safely be ignored. This can be useful, for example, when one is interested in estimating treatment effects only for a certain type of units (e.g., the average treatment effect for the treated, ATT). It is particularly useful when units of interested are not represented in the whole covariate space (i.e., an one-sided overlap problem). Without the target parameter, the function would in such cases try to assign every unit to a group, including units in sparse regions that we are not interested in. This could lead to unnecessarily large and diverse matched groups. By specifying that some units are of interest only insofar as they help us satisfy the matching constraints (i.e., setting the target parameter to the appropriate value), we can avoid such situations.

Consider, as an example, a study with two treatment conditions, "A" and "B". Units assigned to "B" are more numerous and tend to have more extreme covariate values. We are, however, only interested in estimating the treatment effect for units assigned to "A". By specifying target = "A", the function ensures that all "A" units are assigned to matched groups. Some units assigned to treatment "B" – in particular the units with extreme covariate values – will be left unassigned. However, as those units are not of interest, they can safely be ignored, and we avoid groups of poor quality.

Even if some of the units that can be ignored are not needed to satisfy the matching constraints, it is rarely beneficial to discard them blindly; they can occasionally provide useful information. The default behavior when target is non-NULL is to assign as many of the ignorable units as possible given that the within-group distances do not increase too much (using secondary_unassigned_method = "estimated_radius"). This behavior might, however, reduce covariate balance in some instances. If called with secondary_unassigned_method = "ignore", units not specified in target will be discarded unless they are absolutely needed to satisfying the matching constraints. This tends to reduce bias since the within-group distances are minimized, but it could increase variance since we ignore potentially useful information in the sample. An intermediate alternative is to specify an aggressive caliper for the ignorable units, which is done with the secondary_radius parameter. (These parameters are part of the sc_clustering function that quickmatch calls. The target parameter corresponds to the primary_data_points parameter in that function.)

The caliper parameter constrains the maximum distance between units assigned to the same matched group. This is implemented by restricting the edge weight in the graph used to construct the matched groups (see sc_clustering for details). As a result, the caliper will affect all groups in the matching and, in general, make it harder for the function to find good matches even for groups where the caliper is not binding. In particular, a too tight caliper can lead to discarded units that otherwise would be assigned to a group satisfying both the matching constraints and the caliper. For this reason, it is recommended to set the caliper value quite high and only use it to avoid particularly poor matches. It strongly recommended to use the caliper parameter only when primary_unassigned_method = "closest_seed" in the underlying sc_clustering function (which is the default behavior).

quickmatch calls sc_clustering with seed_method = "inwards_updating". The seed_method parameter governs how the seeds are selected in the nearest neighborhood graph that is used to construct the matched groups (see sc_clustering for details). The "inwards_updating" option generally works well and is safe with most datasets. Using seed_method = "exclusion_updating" often leads to better performance (in the sense of matched groups with more similar units), but it may increase run time. Discrete data (or more generally when units tend to be at equal distance to many other units) will lead to particularly poor run time with this option. If the data set has at least one continuous covariate, "exclusion_updating" is typically reasonably quick. A third option is seed_method = "lexical", which decreases the run time relative to "inwards_updating" (sometimes considerably) at the cost of performance. quickmatch passes parameters on to sc_clustering, so to change seed_method, call quickmatch with the parameter specified as usual: quickmatch(..., seed_method = "exclusion_updating").

Value

Returns a qm_matching object with the matched groups.

References

Sävje, Fredrik, Michael J. Higgins and Jasjeet S. Sekhon (2017), ‘Generalized Full Matching’, arXiv 1703.03882. https://arxiv.org/abs/1703.03882

See Also

See sc_clustering for the underlying function used to construct the matched groups.

Examples

# Construct example data
my_data <- data.frame(y = rnorm(100),
                      x1 = runif(100),
                      x2 = runif(100),
                      treatment = factor(sample(rep(c("T1", "T2", "C"), c(25, 25, 50)))))

# Make distances
my_distances <- distances(my_data, dist_variables = c("x1", "x2"))

# Make matching with one unit from "T1", "T2" and "C" in each matched group
quickmatch(my_distances, my_data$treatment)

# Require at least two "C" in each group
quickmatch(my_distances,
           my_data$treatment,
           treatment_constraints = c("T1" = 1, "T2" = 1, "C" = 2))

# Require groups with at least six units in total
quickmatch(my_distances,
           my_data$treatment,
           treatment_constraints = c("T1" = 1, "T2" = 1, "C" = 2),
           size_constraint = 6)

# Focus the matching to units assigned to "T1" and "T2" (i.e., all
# units assigned to "T1" or T2 will be assigned to a matched group).
# Units assigned to treatment "C" will be assigned to groups so to
# ensure that each group contains at least one unit of each treatment
# condition. Remaining "C" units could be left unassigned.
quickmatch(my_distances,
           my_data$treatment,
           target = c("T1", "T2"))

# Impose caliper
quickmatch(my_distances,
           my_data$treatment,
           caliper = 0.25)

# Call `quickmatch` directly with covariate data (ie., not pre-calculating distances)
quickmatch(my_data[c("x1", "x2")], my_data$treatment)

# Call `quickmatch` directly with covariate data using Mahalanobis distances
quickmatch(my_data[c("x1", "x2")],
           my_data$treatment,
           normalize = "mahalanobize")


fsavje/quickmatch documentation built on Dec. 11, 2023, 5:09 a.m.