atlas: Association testing by combining several matching thresholds
In ludic: Linkage Using Diagnosis Codes

Description Usage Arguments Value References Examples

Computes association test p-values from a generalized linear model for each considered threshold, and computes a p-value for the combination of all the envisioned thresholds through Fisher's method using perturbation resampling.

atlas(
  match_prob,
  y,
  x,
  covar = NULL,
  thresholds = seq(from = 0.1, to = 0.9, by = 0.2),
  nb_perturb = 200,
  dist_family = c("gaussian", "binomial"),
  impute_strategy = c("weighted average", "best")
)

`match_prob`	matching probabilities matrix (e.g. obtained through `recordLink`) of dimensions `n1 x n2`.
`y`	response variable of length `n1`. Only binary phenotypes are supported at the moment.
`x`	a `matrix` or a `data.frame` of predictors of dimensions `n2 x p`. An intercept is automatically added within the function.
`covar`	a `matrix` or a `data.frame` of variables to be adjusted on in the test of dimensions `n3 x p`. Default is `NULL` in which case there is no adjustment.
`thresholds`	a vector (possibly of length `1`) containing the different threshold to use to call a match. Default is `seq(from = 0.5, to = 0.95, by = 0.05)`.
`nb_perturb`	the number of perturbation used for the p-value combination. Default is 200.
`dist_family`	a character string indicating the distribution family for the glm. Currently, only `'gaussian'` and `'binomial'` are supported. Default is `'gaussian'`.
`impute_strategy`	a character string indicating which strategy to use to impute x from the matching probabilities `match_prob`. Either `"best"` (in which case the highest probable match above the threshold is imputed) or `"weighted average"` (in which case weighted mean is imputed for each individual who has at least one match with a posterior probability above the threshold). Default is `"weighted average"`.

a list containing the following:

influencefn_pvals p-values obtained from influence function perturbations with the covariates as columns and the thresholds as rows, with an additional row at the top for the combination
wald_pvals a matrix containing the p-values obtained from the Wald test with the covariates as columns and the thresholds as rows
ptbed_pvals a list containing, for each covariates, a matrix with the nb_perturb perturbed p-values with the different thresholds as rows
theta_impute a matrix of the estimated coefficients from the glm when imputing the weighted average for covariates (as columns) with the thresholds as rows
sd_theta a matrix of the estimated SD (from the influence function) of the coefficients from the glm when imputing the weighted average for covariates (as columns), with the thresholds as rows
ptbed_theta_impute a list containing, for each covariates, a matrix with the nb_perturb perturbed estimated coefficients from the glm when imputing the weighted average for covariates, with the different thresholds as rows
impute_strategy a character string indicating which impute strategy was used (either "weighted average" or "best")

Zhang HG, Hejblum BP, Weber G, Palmer N, Churchill S, Szolovits P, Murphy S, Liao KP, Kohane I and Cai T, ATLAS: An automated association test using probabilistically linked health records with application to genetic studies, JAMIA, in press (2021). doi: 10.1101/2021.05.02.21256490.

#rm(list=ls())

n_sims <- 1#5000

mysim <- function(i){
 x <- matrix(ncol=2, nrow=99, stats::rnorm(n=99*2))
 #plot(density(rbeta(n=1000, 1,2)))
 match_prob <- matrix(rbeta(n=103*99, 1, 2), nrow=103, ncol=99)

 #y <- rnorm(n=103, mean = 1, sd = 0.5)
 #return(atlas(match_prob, y, x, dist_family="gaussian")$influencefn_pvals)
 y <- rbinom(n=103, size = 1, prob=0.5)
 return(atlas(match_prob, y, x, dist_family="binomial")$influencefn_pvals)
}
#res <- pbapply::pblapply(1:n_sims, mysim, cl = parallel::detectCores()-1)
res <- lapply(1:n_sims, mysim)

size <- sapply(1:(ncol(res[[1]])-2), 
              FUN = function(i){
           rowMeans(sapply(res, function(m){m[, i]<0.05}), na.rm = TRUE)
           }
)
rownames(size) <- rownames(res[[1]])
colnames(size) <- colnames(res[[1]])[-(-1:0 + ncol(res[[1]]))]
size