ps_match_strata: Propensity Score Matching of strata (two or more classes,...
In PropensitySub: Treatment Effect Estimate in Strata with Missing Data

Description Usage Arguments Value Note Examples

This function perfroms propensity score matching of two or more strata.
It could be used when arm1 has 2 or more strata, while strata information is unknown in arm2.
The function will fit a logistic regression (when 2 classes) or multinomial logistic regression (when > 2 classes) based on strata labels in arm1 (model: label~features), then predict strata labels in both arm1 and arm2 based on the fitted model. The predicted probability of being stratum X will be used for propensity score matching. The matching results will then be used to estimate treatment difference of two arms (Hazard ratio for survival endpoint; response rate difference for binary endpoint). When ties are allowed, weights from the ties will be used to calculate the HR or response rate difference.

ps_match_strata(
  data.in,
  formula,
  indicator.var = "indicator",
  ties = TRUE,
  class.of.int = NULL,
  tte = "AVAL",
  event = "event",
  trt = "trt",
  response = NULL,
  caliper = NULL,
  model = "plain",
  weights = NULL,
  multinom.maxit = 100,
  return.data = TRUE
)

`data.in`	(`data.frame`) input data
`formula`	(`formula`) to input to the logistic or multinomial logistic model (in the form of strata~features)
`indicator.var`	(`string`) column name of the strata indicator variable which must be numeric. Assume arm1 has strata labeling and arm2 does not have strata labeling. pts without strata labeling should be indicated as -1 (e.g. pts in the arm1, or pts in arm2 but with missing label). within arm1 (the arm with strata labeling), subclasss should be indicated as 0,1,2...
`ties`	(`logical`) `TRUE` allows for ties. ties is `TRUE`, all samples in the tie will be included. When calculating summary statistics, samples in ties will be assigned a smaller weight. for example, if two samples ties, these two samples will both be included in the summary statistics calculation with weight 0.5. ties is `FALSE`, one random sample will be draw from the tied samples when calculating summary statistics. In this case, it is recommended to run `ps_match_strata` multiple times with different seeds and take the average or median summary statistics from multiple runs. Note when ties is `FALSE`, codes were tested less thoroughly and extra caution may be needed.
`class.of.int`	(`list`) classes (stratum) of interest. Request to be in list format. It could be subset of classes in arm1; it could also define combined classes. For example: class.of.int = list("class1"=0, "class2"=1, "class3"=2, "class2or3"="c(1,2)"). for "class2or3", Prob(class 2 or 3) will be calculated as Prob(class2) + Prob(class3)
`tte`	(`string`) column name of the time to event variable
`event`	(`string`) column name of the event variable (1: event, 0: censor)
`trt`	(`string`) column name of the treatment variable. The variable is expected to be in `factor` format and the first level will be considered as reference (control) arm when calculating summary statistics.
`response`	(`string`) column name of the response variable. 1 indicates responder and 0 indicates non responder. if response is not NULL, tte and event will be ignored and the function will assume binary outcome.
`caliper`	A scalar or vector denoting the caliper(s) which should be used when matching. A caliper is the distance which is acceptable for any match. Observations which are outside of the caliper are dropped. If a scalar caliper is provided, this caliper is used for all covariates in `X`. If a vector of calipers is provided, a caliper value should be provided for each covariate in `X`. The caliper is interpreted to be in standardized units. For example, `caliper=.25` means that all matches not equal to or within .25 standard deviations of each covariate in `X` are dropped. Note that dropping observations generally changes the quantity being estimated.
`model`	(`string`) one of (`plain`, `dwc`, `wri`). "plain" when 2 levels are specified in indicator variable, a binomial glm will be fitted; when more than 2 levels are specified, a multinomial glm will be fitted; "dwc" Doubly weighted control: Two separated models will be fitted: one is binomial glm of 2 vs. (1, 0), the other one is bionomial glm of 1 vs 0. The probability of being each class is then calculated by aggregating these two models. Note this is similar to the plain method but with different (less rigid) covariance assumptions. "wri" Weight regression imputation: the current status is going to be learned from the next status. Indicator of the next status should be specified using indicator.next. Currently "wri" only support the case where there are only two non-missing strata. In indicator variable, the two nonmissing strata should be coded as 0 and 1, the missing group should be coded as 2.
`weights`	(`numeric`) weights of each subject. If not `NULL`, the estimated probabilities will be reweightsed to ensure sum(probability) of a subject = the subject's weights. If weights is not `NULL`, quasibinomial model will be used.
`multinom.maxit`	see parameter `maxit` in nnet::multinom, default is 100
`return.data`	(`logical`) whether to return data with estimated probabilities.

return a list containing the following components:

stat a matrix with rows as strata and columns as Estimate and CIs.
converged logical to indicate whether model converges.
any_warning_glm logical to indicate whether there's warning from glm model.
warning.msg a list to capture any warning message from the modeling process.
models a list to capture the glm model results.
roc.list a list to capture information about Area under the curve from glm model.
data a data.frame which is the original input data plus predicted probabilities.

Different from the original version, iter is no longer a parameter if tie = FALSE is specified, user need to run for loops of sapply outside of this function to get results from multiple seeds. Three elements in the output list - the data element is a data frame that contains input data and estimated probabilities. The stat element contains estimated treatment difference between 2 arms, in each of the strata of interest. The converge element indicates whether the model converged (taking from $converged from glm and $convergency from multinom) if return.data is FALSE, data won't be returned. model = "wri" is not supported in ps_match_strata

 
library(dplyr)
 # example 1:  Impute NA as one stratum in experimental arm; default model 
clinical_1 <- clinical %>% mutate( 
  indicator = case_when(
    STRATUM == "strata_1" ~ 0, 
    STRATUM == "strata_2" ~ 1,
    is.na(STRATUM) & ARM == "experimental" ~ 1,
    TRUE ~ -1 
  ),
  ARM  = factor(ARM, levels = c("control","experimental")),
  BNLR = case_when(
    is.na(BNLR) ~ median(BNLR, na.rm = TRUE),
    TRUE ~ BNLR
  )
)
ps_res1 <- ps_match_strata(
  data.in = clinical_1, formula = indicator ~ BECOG + SEX + BNLR,
  indicator.var = "indicator", tte = "OS_MONTH", event = "OS_EVENT", trt = "ARM",
  class.of.int = list("strata_1" = 1, "strata_2" = 0)
 )
 ## Weighted HRs
 ps_res1$stat
 
 # example 2: "doubly weighted control" model
clinical_2 <- clinical %>% mutate( 
  indicator = case_when(
    STRATUM == "strata_1" ~ 0, 
    STRATUM == "strata_2" ~ 1,
    is.na(STRATUM) & ARM == "experimental" ~ 2,
    TRUE ~ -1
  ),
  ARM  = factor(ARM, levels = c("control","experimental")),
  BNLR = case_when(
    is.na(BNLR) ~ median(BNLR, na.rm = TRUE),
    TRUE ~ BNLR
  )
)

ps_res2 <- ps_match_strata(
  data.in = clinical_2, formula = indicator ~ BECOG + SEX + BNLR, model = "dwc",
  indicator.var = "indicator", tte = "OS_MONTH", event = "OS_EVENT", trt = "ARM",
  class.of.int = list("strata_1" = 0, "strata_2" = 1, "missing" = 2)
 ) 
 ps_res2$stat
 ps_res2$converged