ICP: Invariant Causal Prediction
In Laksafoss/ICPSurv: Invariant Causal Prediction for Survival Data

Description Usage Arguments Details Value References See Also Examples

View source: R/ICP-main.R

A method for finding causal predictors of a target variable described by either a linear, generalized linear or hazard model. The methodology uses heterogeneous data to make causal inference.

ICP(
  Y,
  X,
  E = NULL,
  model = "lm",
  method = "EnvirRel",
  level = 0.05,
  gof = max(0.01, level),
  maxNoVariables = 8,
  fullAnalysis = FALSE,
  progress = FALSE,
  ...
)

`Y`	The response or target variable of interest. Either a numeric vector or `survival` object.
`X`	A matrix (or data frame) with the predictor variables.
`E`	Indicator of the experiment or the intervention type an observation belongs to. Can be a vector of the same length as `Y` with at least two unique values.
`model`	A character indicating how to model the ditribution of the target variable given covariates. Possible choices are `lm` : Linear Model. `glm` : Generalized Linear Model. When using `model = "glm"` a `family` must also be specified in function options. `ph` : Proportional Hazard Model. `ah` : Additive Hazard Model. `hazard` : Hazad Model. When using `model = "hazard"` a `dist` must be specified.
`method`	A character indicating which method to use. Possible values are `EnvirRel` : Environment Relevance Test. `CR` : Intersecting Confidence Regions Test. Using this method the user may also specify the `solver` (see detailes about "The Invariance Test Methods"). `TimeVar` : Time Variations Test. Using this method the user may also specify `nonparamtest` (see detailes about "The Invariance Test Methods"). See detailes for more guidence on methods.
`level`	Numerical value between 0 and 1 denoting the significance level used when testing. If not specified the algorithm will only calculate the p-values of the null hypotheses (H_0,S>) and draw no conclusions based on these values.
`gof`	If no set of variables (including the empty set) leads to a p-value larger than the goodness-of-fit cutoff `gof`, the whole model will be rejected. If the model is correct, this will happen with a probability of gof. This option protects again making statements when the model is obviously not suitable for the data.
`maxNoVariables`	The maximal subset size (choosing smaller values saves computational resources but increases approximation error).
`fullAnalysis`	If `TRUE` p-values for all null hypotheses will be found. If `FALSE` it will often be possible to save computation time: this depends on the method.
`progress`	If `TRUE` a progress bar will be printed.
`...`	Additional arguments carried to the lower level functions.

The ICP function implements different concrete methods within the methodology of invariant Causal Predictions which was first desriced in Peters et al. (2016) (see references below). This implementation of invariant Causal Predictions is well suited when the distribution of the target variable may be described by a linear model, generalized linear model or hazard model. There are three different methods for testing invariance implemented in ICP - EnvirRel, CR and TimeVar - and they are each given a description below under "The Invarince Test Methods".

As input the ICP function takes a target variable Y which is either a numeric vector or a Survival object, a matrix or data.frame of covariates X and possibly - depending on the method - a vector of environments E. The ICP function computes a p-value of the following family of null hypotheses:

H_0,S : (Y_i ∣ X_i^S = x) = (Y_j ∣ X_j^S = x) in distribution for all indices i, j and x. for every S⊆{1,...,p} (where we have assumed that X encodes p covariates). The results of these hypothesis tests may be found in model.analysis.

If level is specified (a subset of) the causal predictors is estimated using the formula (see Peters et al. (2016) for details):

A = ∩_{{S: H_0,S accepted}} S. The set A is outputted under the name accepted.model. This computation is done by the function model_analysis, which is also a function in its own right.

Moreover, if both level is specified and fullAnalysis = TRUE then the function variable_analysis will calculate the significance of each individual variable in X. This significance table is returned under the name variable.analysis.

The gof parameter protects against making statements when the model is obviously not suitable for the data. If no model reaches the threshold gof significance level, i.e. the p-values for (H_0,S) are all smaller then gof, we report that there is no evidence for individual variables, as there is no evidence for an invariant set.

The Invarince Test Methods

Three different invariance test methods have been implemented:

method = "EnvirRel" : The invariance test method of Environment Relevance is the standard method and can be applied data from to all model types (lm, glm & hazard). This method requires environments E as input.

method = "CR" : The invariance test method of Intersecting Confidence Regions can be applied to data from to all model types (lm, glm & hazard). This method requires environments E as input. Moreover, a solution within the CR method framework may be found in tree different ways: The standard is solver = "QC", which is ususally also the slowest solver. If computational time is an issue the user may need to use the approximate solvers solver = "pairwise" or solver = "marginal".

method = "TimeVar" : The invariance test method of Time Variability can only be applied to data from "ph" or "ah" type models. This method does not require environment information, as it uses time as environment. The "TimeVar" method has three different concrete nonparamtests: a Kolmogorov–Smirnov test type test denotes "sup", a Cramér–von Mises criterion type test denoted "int", or simply both tests denoted "test".

The ICP function returns an object of class ICP. Such an object will contain the following

`model.analysis`	A data frame listing the different models tested in the first column and the found p-values in the second column.
`call`	The matched call.
`level`	The significance level. If not specified this is `NULL`.
`method`	The method object used for the model fitting and hypothesis testing.
`accepted.model`	The estimated causal predictors. Only returned if `level` is specified in the input.
`empty.message`	If the empty set is returned as `accepted.model` then `empty.message` will give detailes. Only returned if `level` is specified in the input.
`variable.analysis`	A data.frame with each predictor variables significance as causal predictors. Will only be returned if `fullAnalysis = TRUE` in the input options.

Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78.5 (2016): 947-1012.

model_analysis calculates the accepted model.

variable_analysis calculates the individual variables significance.

# ===========================================================================
# An example with normal distributions
n <- 500
E <- sample(5L, n, replace = TRUE)
X <- data.frame(X1 = rnorm(n, E, 1), X2 = rnorm(n, 3 * (E %in% c(1,5)), 1))
Y <- rnorm(n, X$X1, 1) # X1 is the true parent

# Environment Relevance Test:
ICP(Y, X, E)

# Intersecting Confidence Region Test, Quadratically Constrained Solver:
ICP(Y, X, E, method = "CR")

# Intersecting Confidence Region Test, Pairwise Solver:
ICP(Y, X, E, method = "CR", solver = "pairwise")

# Intersecting Confidence Region Test, Marginal Solver:
ICP(Y, X, E, method = "CR", solver = "marginal")


# ===========================================================================
# An example with a poisson distribution
Y <- rpois(n, exp(X$X1)) # true causal is X1

# Environment Relevance Test
ICP(Y, X, E, model = "glm", family = "poisson")

# Intersecting Confidence Region Test, Quadratically Constrained Solver:
ICP(Y, X, E, model = "glm", family = "poisson", method = "CR")

# Intersecting Confidence Region Test, Pairwise Solver:
ICP(Y, X, E, model = "glm", family = "poisson",
    method = "CR", solver = "pairwise")

# Intersecting Confidence Region Test, Marginal Solver:
ICP(Y, X, E, model = "glm", family = "poisson",
    method = "CR", solver = "marginal")


# ===========================================================================
# An example with right censored survival times
Y <- rexp(n, exp(- 0.5 * X$X1))
C <- rexp(n, exp(- 1.5))
time <- pmin(Y, C) # trues causal is X1
status <- time == Y

# Environment Relevance Test
ICP(survival::Surv(time, status), X, E, model = "ph")

# The user may also define their own link functions, see
# ?survival::survreg.distributions
my_dist <- survival::survreg.distributions$exponential
my_dist$trans <- function(y) log(y / 365)
my_dist$dtrans <- function(y) 1 / y
my_dist$itrans <- function(y) 365 * exp(y)
ICP(survival::Surv(time, status), X, E, model = "hazard", dist = my_dist)
# this example is simply a reparametrization and therefore
# gives the same results as above.

# Intersecting Confidence Regions Test, Quadratically Constrained Solver:
ICP(survival::Surv(time, status), X, E, model = "ph", method = "CR")

# Intersecting Confidence Regions Test, Pairwise Solver:
ICP(survival::Surv(time, status), X, E, model = "ph",
    method = "CR", solver = "pairwise")

# Intersecting Confidence Regions Test, Marginal Solver:
ICP(survival::Surv(time, status), X, E, model = "ph",
    method = "CR", solver = "marginal")

# Non-parametric Tests of Time Varying Effect
ICP(survival::Surv(time, status), X, E, model = "ph", method = "TimeVar")

# Non-parametric Tests of Time Varying Effect with n.sim = 1000
ICP(survival::Surv(time, status), X, E, model = "ph",
    method = "TimeVar", n.sim = 1000)