Getting started with ghcm

\newcommand{\independent}{\mbox{${}\perp\mkern-11mu\perp{}$}} \newcommand{\notindependent}{\mbox{${}\not!\perp\mkern-11mu\perp{}$}} \newcommand{\cond}{\,|\,} \newcommand{\ind}{\mathbf{1}}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align="center",
  dpi=125,
  fig.width=6,
  fig.height=4,
  out.width="100%"
)
options(rmarkdown.html_vignette.check_title = FALSE) 
library(ghcm)

Introduction

ghcm is an R package used to perform conditional independence tests for functional data.

This vignette gives a brief overview of the usage of the ghcm package. We give a brief presentation of the idea behind the GHCM and the conditions under which the test is valid. Subsequently, we provide several examples of the usage of the ghcm package by analysing a simulated data set.

The Generalised Hilbertian Covariance Measure (GHCM)

In this section we briefly describe the idea behind the GHCM. For the full technical details and theoretical results, see [@Lundborg2021].

Let $X$, $Y$ and $Z$ be random variables of which we are given $n$ i.i.d. observations $(X_1, Y_1, Z_1), \dots, (X_n, Y_n, Z_n)$; here $X$, $Y$ and $Z$ can be either scalar or functional. Existing methods, such as the GCM [@GCM] implemented in the GeneralisedCovarianceMeasure package [@GCM-package], can deal with most cases where both $X$ and $Y$ are scalar hence our primary interest is in the cases where at least one of $X$ and $Y$ are functional.

The GHCM estimates the squared Hilbert--Schmidt norm of the expected conditional covariance of $X$ and $Y$ given $Z$, $\| \mathscr{K} \|_{\mathrm{HS}}^2$, and rejects the hypothesis $X \independent Y \cond Z$ if this quantity is too large. We denote by $\langle f, g \rangle$ the inner product between $f$ and $g$. We compute the GHCM as follows.

  1. Regress $X$ on $Z$ and $Y$ on $Z$ yielding residuals $\hat{\varepsilon}$ and $\hat{\xi}$, respectively.
  2. Compute the test statistic $$ T = \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^n \langle \hat{\varepsilon}_i, \hat{\varepsilon}_j \rangle \langle \hat{\xi}_i, \hat{\xi}_j \rangle. $$
  3. Compute the $n \times n$ matrix $\Gamma$ by setting $$ \Gamma_{ij} := \langle \hat{\varepsilon}_i, \hat{\varepsilon}_j \rangle \langle \hat{\xi}_i, \hat{\xi}_j \rangle. $$
  4. Compute the $d \leq n-1$ non-zero eigenvalues $(\lambda_k)_{k=1}^d$ of $$ C := \frac{1}{n-1} (\Gamma - J \Gamma - \Gamma J + J \Gamma J), $$ where $J$ is the $n \times n$ matrix with all entries equal to $1/n$.
  5. Produce a $p$-value by setting $$ p := \mathbb{P}\left(\sum_{k=1}^d \lambda_k \zeta_k^2 > T\right), $$ where $\zeta_1, \dots, \zeta_d$ are i.i.d. standard Gaussian random variables.

Assuming that the regression methods perform sufficiently well, the GHCM has approximately uniformly distributed $p$-values when the null is true. It should be noted that there are situations where $X \notindependent Y \cond Z$ but the GHCM is unable to detect this dependence for any sample size, since $\mathscr{K}$ can be zero in this case.

The GHCM as implemented in the \code{ghcm_test} function uses different methods to compute the inner products as required above depending on the format of the given residuals. For residuals from scalar or multivariate variables or functional variables observed on a constant, fixed grid, the inner products are computed as the usual Euclidean inner products and no further preprocessing is done. For residuals coming from irregularly observed functional variables or functions on a fixed grid with missing values, the inner products are computed as L2 inner products computed from spline interpolants of the curves. See the following sections for examples of both types of use.

Example applications on a simulated dataset

The GHCM with regular functional data

To give concrete examples of the usage of the package, we perform conditional independence tests on a simulated data set consisting of both functional and scalar variables. The functional variables are observed on a common equidistant grid of $101$ points on $[0, 1]$.

library(ghcm)
set.seed(111)
data(ghcm_sim_data)
grid <- seq(0, 1, length.out=101)
colnames(ghcm_sim_data)

ghcm_sim_data consists of 500 observations of the scalar variables $Y_1$ and $Y_2$ and the functional variables $X$, $Z$ and $W$. The curves and the mean curve for functional data can be seen in Figures \@ref(fig:plot-X), \@ref(fig:plot-Z) and \@ref(fig:plot-W).

matplot(grid, t(ghcm_sim_data$X), type="l",
        col=rgb(0,0,0,0.1) , lty=1, xlab="",
        ylab="")
lines(grid, colMeans(ghcm_sim_data$X), col=2,
      lty=2, lwd=2)
title("X")
matplot(grid, t(ghcm_sim_data$Z), type="l",
        col=rgb(0,0,0,0.1) , lty=1, xlab="",
        ylab="")
lines(grid, colMeans(ghcm_sim_data$Z), col=2,
      lty=2, lwd=2)
title("Z")
matplot(grid, t(ghcm_sim_data$W), type="l",
        col=rgb(0,0,0,0.1) , lty=1, xlab="",
        ylab="")
lines(grid, colMeans(ghcm_sim_data$W), col=2,
      lty=2, lwd=2)
title("W")

In all of the upcoming examples we will use functions from the refund R-package [@refund] to perform regressions. The simulated data is obtained such that the regressions we perform are well-specified hence we do not have to worry about the performance of our regression methods. However, in actual applications of the GHCM, it is critical that the regression methods employed estimate the conditional expectations $\mathbb{E}(X \cond Z)$ and $\mathbb{E}(Y \cond Z)$ sufficiently well for the $p$-values to be valid. Any use of the GHCM should be prefaced by an analysis of the performance of the regression methods in use.

Testing independence of scalar variables given functional variables

We first test whether $Y_1$ and $Y_2$ are conditionally independent given the functional variables. This is relevant if, say, we're trying to predict $Y_1$ and we want to know whether including $Y_2$ as a predictor would be helpful. A naive correlation-based approach would suggest that $Y_2$ could be relevant since:

cor(ghcm_sim_data$Y_1, ghcm_sim_data$Y_2)

To perform the conditional independence test, we need a scalar-on-function regression method and we will use the pfr function from the refund package [@refund] with lf-terms. We run the test in the following code:

library(refund)
m_1 <- pfr(Y_1 ~ lf(X) + lf(Z) + lf(W) , data=ghcm_sim_data)
m_2 <- pfr(Y_2 ~ lf(X) + lf(Z) + lf(W), data=ghcm_sim_data)
test <- ghcm_test(resid(m_1), resid(m_2))
print(test)

We obtain a $p$-value of 0.958. It should be noted that since the asymptotic distribution of the test statistic depends on the underlying distribution, there is no way to know the $p$-value from just the test statistic alone, hence it is not reported.

Testing independence of a scalar and functional variable given a functional variable

We now test whether $Y_1 \independent X \cond Z$. This is relevant if we're interested in modelling $Y_1$ and want to determine whether $X$ should be included in a model that already includes $Z$. We can plot $X$ and $Z$ and color the curves based on the value of $Y_1$ as can be seen in Figures \@ref(fig:Z-Y-plot) and \@ref(fig:X-Y-plot) below.

library(ggplot2)
library(reshape2)

Z_df <- as.data.frame(ghcm_sim_data$Z)
colnames(Z_df) <- grid
Z_df$Y_1<- ghcm_sim_data$Y_1
Z_df$id <- 1:nrow(ghcm_sim_data)
tmp <- melt(Z_df, id.var=c("id", "Y_1"), variable.name="grid")
tmp$grid <- as.numeric(levels(tmp$grid))[tmp$grid]


ggplot(tmp, aes(x=grid, y=value, group=id)) + geom_line(aes(color=Y_1)) +
  theme_bw() + scale_x_continuous(name="") + scale_y_continuous(name="") +
  scale_color_gradient(name=expression(Y[1]))
library(ggplot2)
library(dplyr)
library(tidyr)
library(reshape2)

X_df <- as.data.frame(ghcm_sim_data$X)
colnames(X_df) <- grid
X_df$Y_1<- ghcm_sim_data$Y_1
X_df$id <- 1:nrow(ghcm_sim_data)
tmp <- melt(X_df, id.var=c("id", "Y_1"), variable.name="grid")
tmp$grid <- as.numeric(levels(tmp$grid))[tmp$grid]


ggplot(tmp, aes(x=grid, y=value, group=id)) + geom_line(aes(color=Y_1)) +
  theme_bw() + scale_x_continuous(name="") + scale_y_continuous(name="") +
  scale_color_gradient(name=expression(Y[1]))

It appears that both of the functional variables contain information about $Y_1$. To use the GHCM for this test, in addition to the scalar-on-function regression employed in the previous section, we will need to be able to perform function-on-function regressions. This is done using the pffr function in the refund package [@refund] with ff terms. We run the test in the following code:

m_1 <- pfr(Y_1 ~ lf(Z), data = ghcm_sim_data)
m_X <- pffr(X ~ ff(Z), data = ghcm_sim_data, chunk.size = 31000)
test <- ghcm_test(resid(m_X), resid(m_1))
print(test)

We obtain a $p$-value of 0.865.

Testing independence of functional variables given a functional variable

Finally, we test whether $X \independent W \cond Z$, which could be relevant in creating prediction models for $X$ or $W$ or in simply ascertaining the relationships between the functional variables. We run the test in the following code:

m_X <- pffr(X ~ ff(Z), data=ghcm_sim_data, chunk.size=31000)
m_W <- pffr(W ~ ff(Z), data=ghcm_sim_data, chunk.size=31000)
test <- ghcm_test(resid(m_X), resid(m_W))
print(test)

We get a $p$-value of $0.6304$.

The GHCM with irregular functional data

To illustrate the GHCM for irregular functional data, we load a version of the simulated dataset from before where we only observe a random subset of the points on each $X$ and $W$ curve. Although $Z$ is not irregular in the examples below, if the functional regression method employed is able to perform the regression with irregular predictors, the GHCM will still control the type I error rate.

data(ghcm_sim_data_irregular)

ghcm_sim_data_irregularis a list containing $Y_1$, $Y_2$ and $Z$ as before.ghcm_sim_data_irregularalso contains subsamples of $X$ and $W$ in a particular "melted" format:

head(ghcm_sim_data_irregular$X)

The column \code{.obs} denotes which curve the observation comes from. The \code{.index} column denotes the function argument while the \code{.value} column denotes the function value. We can plot and compare the observations in the regular and irregular functional observations as seen in Figures \@ref(fig:plot-X-irregular) and \@ref(fig:plot-W-irregular).

regular_X <- melt(ghcm_sim_data$X, varnames = c(".obs", ".index"), value.name = ".value")
regular_X$.index <- grid[regular_X$.index]
plot_df <- left_join(regular_X,ghcm_sim_data_irregular$X %>% mutate(subsampled=TRUE)) %>% replace_na(list(subsampled=FALSE))
plot_df %>% subset(.obs <= 5) %>% ggplot(aes(x=.index, y=.value, color=as.factor(.obs))) + geom_point(aes(shape=subsampled, size=subsampled)) + geom_line(alpha=0.3) +
  scale_size_manual(values=c(0.5, 2)) + guides(color="none", size="none", shape="none") +
  scale_x_continuous(name=NULL) + scale_y_continuous(name=NULL) + theme_bw()
regular_W <- melt(ghcm_sim_data$W, varnames = c(".obs", ".index"), value.name = ".value")
regular_W$.index <- grid[regular_W$.index]
plot_df <- left_join(regular_W,ghcm_sim_data_irregular$W %>% mutate(subsampled=TRUE)) %>% replace_na(list(subsampled=FALSE))
plot_df %>% subset(.obs <= 5) %>% ggplot(aes(x=.index, y=.value, color=as.factor(.obs))) + geom_point(aes(shape=subsampled, size=subsampled)) + geom_line(alpha=0.3) +
  scale_size_manual(values=c(0.5, 2)) + guides(color="none", size="none", shape="none") +
  scale_x_continuous(name=NULL) + scale_y_continuous(name=NULL) + theme_bw()

Testing independence of a scalar and functional variable given a functional variable

We now repeat the analysis from earlier on the subsampled data and test whether $Y_1 \independent X \cond Z$. We still use the pffr function from the refund package to do the function-on-function regression although the specification is different now that $X$ is no longer regularly observed. We run the test in the following code:

n <- nrow(ghcm_sim_data_irregular$Z)
Z_df <- data.frame(.obs=1:n)
Z_df$Z <- ghcm_sim_data_irregular$Z
m_1 <- pfr(Y_1 ~ lf(Z), data = ghcm_sim_data_irregular)
m_X <- pffr(X ~ ff(Z), ydata = ghcm_sim_data_irregular$X, data=Z_df, chunk.size=31000)
test <- ghcm_test(resid(m_X), resid(m_1), X_limits=c(0, 1))
print(test)

We obtain a $p$-value of $0.883$, which is very similar to the $p$-value of the regularly observed data of $0.865$.

Testing independence of functional variables given a functional variable

Finally, we repeat the test whether $X \independent W \cond Z$ on the irregular data. We run the test in the following code:

n <- nrow(ghcm_sim_data_irregular$Z)
Z_df <- data.frame(.obs=1:n)
Z_df$Z <- ghcm_sim_data_irregular$Z

m_X <- pffr(X ~ ff(Z), ydata = ghcm_sim_data_irregular$X, data=Z_df, chunk.size=31000)
m_W <- pffr(W ~ ff(Z), ydata = ghcm_sim_data_irregular$W, data=Z_df, chunk.size=31000)
test <- ghcm_test(resid(m_X), resid(m_W), X_limits=c(0, 1), Y_limits=c(0, 1))
print(test)

We obtain a $p$-value of $0.592$ which is again similar to the $p$-value of the regularly observed data of $0.6304$.

Bibliography



Try the ghcm package in your browser

Any scripts or data that you put into this service are public.

ghcm documentation built on Nov. 2, 2023, 5:48 p.m.