seqICP: Sequential Invariant Causal Prediction

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Estimates the causal parents S of the target variable Y using invariant causal prediction and fits a linear model of the form
Y = a X^S + N.

Usage

1
2
3
4
5
seqICP(X, Y, test = "decoupled", par.test = list(grid = c(0,
  round(nrow(X)/2), nrow(X)), complements = FALSE, link = sum, alpha = 0.05, B =
  100, permutation = FALSE), model = "iid", par.model = list(pknown = FALSE,
  p = 0, max.p = 10), max.parents = ncol(X), stopIfEmpty = TRUE,
  silent = TRUE)

Arguments

X

matrix of predictor variables. Each column corresponds to one predictor variable.

Y

vector of target variable, with length(Y)=nrow(X).

test

string specifying the hypothesis test used to test for invariance of a parent set S (i.e. the null hypothesis H0_S). The following tests are available: "decoupled", "combined", "trend", "variance", "block.mean", "block.variance", "block.decoupled", "smooth.mean", "smooth.variance", "smooth.decoupled" and "hsic".

par.test

parameters specifying hypothesis test. The following parameters are available: grid, complements, link, alpha, B and permutation. The parameter grid is an increasing vector of gridpoints used to construct enviornments for change point based tests. If the parameter complements is 'TRUE' each environment is compared against its complement if it is 'FALSE' all environments are compared pairwise. The parameter link specifies how to compare the pairwise test statistics, generally this is either max or sum. The parameter alpha is a numeric value in (0,1) indicting the significance level of the hypothesis test. The parameter B is an integer and specifies the number of Monte-Carlo samples (or permutations) used in the approximation of the null distribution. If the parameter permutation is 'TRUE' a permuatation based approach is used to approximate the null distribution, if it is 'FALSE' the scaled residuals approach is used.

model

string specifying the underlying model class. Either "iid" if Y consists of independent observations or "ar" if Y has a linear time dependence structure.

par.model

parameters specifying model. The following parameters are available: pknown, p and max.p. If pknown is 'FALSE' the number of lags will be determined by comparing all fits up to max.p lags using the AIC criterion. If pknown is 'TRUE' the procedure will fit p lags.

max.parents

integer specifying the maximum size for admissible parents. Reducing this below the number of predictor variables saves computational time but means that the confidence intervals lose their coverage property.

stopIfEmpty

if ‘TRUE’, the procedure will stop computing confidence intervals if the empty set has been accepted (and hence no variable can have a signicificant causal effect). Setting to ‘TRUE’ will save computational time in these cases, but means that the confidence intervals lose their coverage properties for values different to 0.

silent

If 'FALSE', the procedure will output progress notifications consisting of the currently computed set S together with the p-value resulting from the null hypothesis H0_S

Details

The function can be applied to two types of models
(1) a linear model (model="iid")
Y_i = a X_i^S + N_i
with iid noise N_i and
(2) a linear autoregressive model (model="ar")
Y_t = a_0 X_t^S + ... + a_p (Y_(t-p),X_(t-p)) + N_t
with iid noise N_t.

For both models the invariant prediction procedure is applied using the hypothesis test specified by the test parameter to determine whether a candidate model is invariant. For further details see the references.

Value

object of class 'seqICP' consisting of the following elements

parent.set

vector of the estimated causal parents.

test.results

matrix containing the result from each individual test as rows.

S

list of all the sets that were tested. The position within the list corresponds to the index in the first column of the test.results matrix.

p.values

p-value for being not included in the set of true causal parents. (If a p-value is smaller than alpha, the corresponding variable is a member of parent.set.)

coefficients

vector of coefficients resulting from a regression based on the estimated parent set.

stopIfEmpty

a boolean value indicating whether computations stop as soon as intersection of accepted sets is empty.

modelReject

a boolean value indicating if the whole model was rejected (the p-value of the best fitting model is too low).

pknown

a boolean value indicating whether the number of lags in the model was known. Only relevant if model was set to "ar".

alpha

significance level at which the hypothesis tests were performed.

n.var

number of predictor variables.

model

either "iid" or "ar" depending on which model was selected.

Author(s)

Niklas Pfister and Jonas Peters

References

Pfister, N., P. Bühlmann and J. Peters (2017). Invariant Causal Prediction for Sequential Data. ArXiv e-prints (1706.08058).

Peters, J., P. Bühlmann, and N. Meinshausen (2016). Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, Series B (with discussion) 78 (5), 947–1012.

See Also

The function seqICP.s allows to perform hypothesis test for individual sets S. For non-linear models the functions seqICPnl and seqICPnl.s can be used.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
set.seed(1)

# environment 1
na <- 140
X1a <- 0.3*rnorm(na)
X3a <- X1a + 0.2*rnorm(na)
Ya <- -.7*X1a + .6*X3a + 0.1*rnorm(na)
X2a <- -0.5*Ya + 0.5*X3a + 0.1*rnorm(na)

# environment 2
nb <- 80
X1b <- 0.3*rnorm(nb)
X3b <- 0.5*rnorm(nb)
Yb <- -.7*X1b + .6*X3b + 0.1*rnorm(nb)
X2b <- -0.5*Yb + 0.5*X3b + 0.1*rnorm(nb)

# combine environments
X1 <- c(X1a,X1b)
X2 <- c(X2a,X2b)
X3 <- c(X3a,X3b)
Y <- c(Ya,Yb)
Xmatrix <- cbind(X1, X2, X3)

# Y follows the same structural assignment in both environments
# a and b (cf. the lines Ya <- ... and Yb <- ...).
# The direct causes of Y are X1 and X3.
# A linear model considers X1, X2 and X3 as significant.
# All these variables are helpful for the prediction of Y.
summary(lm(Y~Xmatrix))

# apply seqICP to the same setting
seqICP.result <- seqICP(X = Xmatrix, Y,
par.test = list(grid = seq(0, na + nb, (na + nb)/10), complements = FALSE, link = sum,
alpha = 0.05, B =100), max.parents = 4, stopIfEmpty=FALSE, silent=FALSE)
summary(seqICP.result)
# seqICP is able to infer that X1 and X3 are causes of Y

Example output

Call:
lm(formula = Y ~ Xmatrix)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.205831 -0.061317 -0.001113  0.057515  0.266640 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.001799   0.005980   0.301    0.764    
XmatrixX1   -0.583158   0.027397 -21.285  < 2e-16 ***
XmatrixX2   -0.379482   0.047765  -7.945 1.06e-13 ***
XmatrixX3    0.687121   0.018082  38.000  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08813 on 216 degrees of freedom
Multiple R-squared:  0.904,	Adjusted R-squared:  0.9027 
F-statistic: 678.1 on 3 and 216 DF,  p-value: < 2.2e-16

Currently fitting set S = {}
p-value: 0.02
Currently fitting set S = {1}
p-value: 0.02
Currently fitting set S = {2}
p-value: 0.02
Currently fitting set S = {3}
p-value: 0.02
Currently fitting set S = {1, 2}
p-value: 0.02
Currently fitting set S = {1, 3}
p-value: 0.32
Currently fitting set S = {2, 3}
p-value: 0.02
Currently fitting set S = {1, 2, 3}
p-value: 0.2

 Invariant Linear Causal Regression at level 0.05
 Variables X1, X3 show a significant causal effect
 
           coefficient lower bound upper bound  p-value  
intercept         0.0    -0.05900      0.0179       NA  
X1               -0.7    -0.75200     -0.5292     0.02 *
X2                0.0     0.00000      0.0000     0.32  
X3                0.6     0.57000      0.7228     0.02 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

seqICP documentation built on May 2, 2019, 5:51 a.m.

Related to seqICP in seqICP...