Generate a Distance Matrix

Share:

Description

The gendistance function creates an (N+K)x(N+K) distance matrix from an NxP covariates matrix, where N is the number of subjects, P the number of covariates, and K the number of phantom subjects requested (see ndiscard option). Provided the covariates' covariance matrix is invertible, the distances computed are Mahalanobis distances, or if covariate weights are provided, Reweighted Mahalanobis distances (see weights option and Greevy, et al., Pharmacoepidemiology and Drug Safety 2012).

Usage

1
2
3
4
gendistance(covariate, idcol = NULL, weights = NULL, prevent = NULL,
  force = NULL, rankcols = NULL, missing.weight = 0.1, ndiscard = 0,
  singular.method = "solve", talisman = NULL, prevent.res.match = NULL,
  ...)

Arguments

covariate

A data.frame object, containing the covariates of the data set.

idcol

An integer or column name, providing the index of the column containing row ID's.

weights

A numeric vector, the length should match the number of columns. This value determines how much weight is given to each column when generating the distance matrix.

prevent

A vector of integers or column names, providing the index of columns that should be used to prevent matches. When generating the distance matrix, elements that match on these columns are given a maximum distance.

force

An integer or column name, providing the index of the column containing information used to force pairs to match.

rankcols

A vector of integers or column names, providing the index of columns that should have the rank function applied to them before generating the distance matrix.

missing.weight

A numeric value, or vector, used to generate the weight of missingness indicator columns. Missingness indicator columns are created if there is missing data within the data set. Defaults to 0.1. If a single value is supplied, weights are generating by multiplying this by the original columns' weight. If a vector is supplied, it's length should match the number of columns with missing data, and the weight is used as is.

ndiscard

An integer, providing the number of elements that should be allowed to match phantom values. The default value is 0.

singular.method

A character string, indicating the function to use when encountering a singular matrix. By default, solve is called. The alternative is to call ginv from the MASS package.

talisman

An integer or column name, providing location of talisman column. The talisman column should only contains values of 0 and 1. Records with zero will match phantoms perfectly, while other records will match phantoms at max distance.

prevent.res.match

An integer or column name, providing location of the column containing assigned treatment groups. This is useful in some settings, such as trickle-in randomized trials. When set, non-NA values from this column are replaced with the value 1. This prevents records with previously assigned treatments (the ‘reservior’) from matching each other.

...

Additional arguments, not used at this time.

Details

Given a data.frame of covariates, generate a distance matrix. Missing values are imputed with fill.missing. For each column with missing data, a missingness indicator column will be added. Phantoms are fake elements that perfectly match all elements. They can be used to discard a certain number of elements.

Value

a list object with several elements

dist

generated distance matrix

cov

covariate matrix used to generate distances

ignored

ignored columns from original covariate matrix

weights

weights applied to each column in covariate matrix

prevent

columns used to prevent matches

mates

index of rows that should be forced to match

rankcols

index of columns that should use rank

missing.weight

weight to apply to missingness indicator columns

ndiscard

number of elements that will match phantoms

Author(s)

Cole Beck

See Also

distancematrix

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
set.seed(1)
df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25))
# add some missing data
df[sample(seq_len(nrow(df)), ceiling(nrow(df)*0.1)), 2] <- NA
df.dist <- gendistance(df, idcol=1, ndiscard=2)
# up-weight the second column
df.weighted <- gendistance(df, idcol=1, weights=c(1,2,1), ndiscard=2, missing.weight=0.25)
df[,3] <- df[,2]*2
df.sing.solve <- gendistance(df, idcol=1, ndiscard=2)
df.sing.ginv <- gendistance(df, idcol=1, ndiscard=2, singular.method="ginv")