knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Using a library of base kernels, CVEK learns the generating function from data by directly minimizing the ensemble model's error, and tests whether the data is generated by the RKHS under the null hypothesis. Part I presents a simple example to conduct Gaussian process regression and hypothesis testing using cvek function on simulated data. Part II shows a real-world application where we use CVEK to understand whether the per capita crime rate impacts the relationship between the local socioeconomic status and the housing price at Boston, MA, U.S.A. Finally, Part III provides code showing how to visualize the interaction effect from a cvek model.
Download the package from CRAN or GitHub and then install and load it.
library(CVEK) library(ggplot2) library(ggrepel)
We generate a simulated dataset using the $"linear"$ kernel, and set the relative interaction strength to be $0.2$. The outcome $y_i$ is generated as, \begin{align} y_i=h_1(\mathbf{x}{i, 1})+h_2(\mathbf{x}{i, 2})+0.2 * h_{12}(\mathbf{x}{i, 1}, \mathbf{x}{i, 2})+\epsilon_i, \end{align} where $h_1$, $h_2$, $h_{12}$ are sampled from RKHSs $\textit{H}1$, $\textit{H}_2$, $\textit{H}{12}$, generated using the corresponding linear kernel. We standardize all sampled functions to have unit form, so that $0.2$ represents the strength of interaction relative to the main effect.
set.seed(0726) n <- 60 # including training and test d <- 4 int_effect <- 0.2 data <- matrix(rnorm(n * d), ncol = d) Z1 <- data[, 1:2] Z2 <- data[, 3:4] kern <- generate_kernel(method = "linear") w <- rnorm(n) w12 <- rnorm(n) K1 <- kern(Z1, Z1) K2 <- kern(Z2, Z2) K1 <- K1 / sum(diag(K1)) # standardize kernel K2 <- K2 / sum(diag(K2)) h0 <- K1 %*% w + K2 %*% w h0 <- h0 / sqrt(sum(h0 ^ 2)) # standardize main effect h1_prime <- (K1 * K2) %*% w12 # interaction effect # standardize sampled functions to have unit norm, so that 0.2 # represents the interaction strength relative to main effect Ks <- svd(K1 + K2) len <- length(Ks$d[Ks$d / sum(Ks$d) > .001]) U0 <- Ks$u[, 1:len] h1_prime_hat <- fitted(lm(h1_prime ~ U0)) h1 <- h1_prime - h1_prime_hat h1 <- h1 / sqrt(sum(h1 ^ 2)) # standardize interaction effect Y <- h0 + int_effect * h1 + rnorm(1) + rnorm(n, 0, 0.01) data <- as.data.frame(cbind(Y, Z1, Z2)) colnames(data) <- c("y", paste0("z", 1:d)) data_train <- data[1:40, ] data_test <- data[41:60, ]
The resulting data look as follows.
knitr::kable(head(data_train, 5))
Now we can apply the cvek function to conduct Gaussian process regression. Below table is a detailed list of all the arguments of the function cvek.
knitr::include_graphics("table1.pdf", auto_pdf = TRUE)
Suppose we want our kernel library to contain three kernels: $"linear"$, $"polynomial"$ with $p=2$, and $"rbf"$ with $l=1$ (the effective parameter for $"polynomial"$ is $p$ and the effective parameter for $"rbf"$ is $l$, so we can set anything to $l$ for $"polynomial"$ kernel and $p$ for $"rbf"$ kernel). We then first apply define_library.
kern_par <- data.frame(method = c("linear", "polynomial", "rbf"), l = rep(1, 3), p = 1:3, stringsAsFactors = FALSE) # define kernel library kern_func_list <- define_library(kern_par)
The null model is then $y \sim z1 + z2 + k(z3, z4)$.
formula <- y ~ z1 + z2 + k(z3, z4)
With all these parameters specified, we can conduct Gaussian process regression.
est_res <- cvek(formula, kern_func_list = kern_func_list, data = data_train) est_res$lambda est_res$u_hat
We can see that the ensemble weight assigns $0.99$ to the $"linear"$ kernel, which is the true kernel. This illustrates the accuracy and efficiency of the CVEK method.
We next specify the testing procedure. Note that we can use the same function cvek to perform hypothesis testing, as we did for estimation, but we need to provide $formula_test$, which is the user-supplied formula indicating the additional alternative effect (e.g., interactions) to test for. Specifically, we will first show how to conduct the classic score test by specifying $test="asymp"$, followed by a bootstrap test where we specify $test="boot"$, and the number of bootstrap samples $B=200$.
formula_test <- y ~ k(z1, z2):k(z3, z4) cvek(formula, kern_func_list = kern_func_list, data = data_train, formula_test = formula_test, mode = "loocv", strategy = "stack", beta_exp = 1, lambda = exp(seq(-10, 5)), test = "asymp", alt_kernel_type = "ensemble", verbose = FALSE)$pvalue cvek(formula, kern_func_list = kern_func_list, data = data_train, formula_test = formula_test, mode = "loocv", strategy = "stack", beta_exp = 1, lambda = exp(seq(-10, 5)), test = "boot", alt_kernel_type = "ensemble", B = 200, verbose = FALSE)$pvalue
Both tests come to the same conclusion. At the significance level $0.05$, we reject the null hypothesis that there's no interaction effect, which matches our data generation mechanism. Additionally, we can prediction new outcomes based on estimation result est_res.
y_pred <- predict(est_res, data_test[, 2:5]) data_test_pred <- cbind(y_pred, data_test)
knitr::kable(head(data_test_pred, 5))
In this part, we show an example of using cvek test to detect nonlinear interactions between socioeconomic factors that contribute to housing price in the city of Boston, Massachusetts, USA. We consider the Boston dataset (available in the MASS package), which is collected by the U.S Census Service about the median housing price ($medv$) in Boston, along with additional variables describing local socioeconomic information such as per capita crime rate, proportion of non-retail business, number of rooms per household, etc. Below table lists the $14$ variables.
knitr::include_graphics("table2.pdf", auto_pdf = TRUE)
Here we use cvek to study whether the per capita crime rate ($crim$) impacts the relationship between the local socioeconomic status ($lstat$) and the housing price. The null model is, \begin{align} medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat), \end{align} where $\mathbf{x}^\top=(1, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black)$, and $k()$ is specified as a semi-parametric model with a model library that includes linear and rbf kernels with $l=1$. This inclusion of nonlinearity (i.e., the rbf kernel) is important, since per classic results in the macroeconmics literature, the crime rates and socioeconomic status of local community are known to have nonlinear association with the local housing price harrison_hedonic_1978.
kern_par <- data.frame(method = c("linear", "rbf"), l = rep(1, 2), p = 1:2, stringsAsFactors = FALSE) # define kernel library kern_func_list <- define_library(kern_par)
To this end, the hypothesis regarding whether the crime rate ($crim$) impacts the association between local socioeconomic status ($lstat$) and the housing price ($medv$) is equivalent to testing whether there exists a nonlinear interaction between $crim$ and $lstat$ in predicting $medv$, i.e., \begin{align} \mathcal{H}_0: &\; medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat), \ \mathcal{H}_a: &\; medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat) + k(crim):k(lstat). \end{align}
To test this hypothesis using cvek, we specify the null model using formula, and specify the additional interaction term ($k(crim):k(lstat)$) in the alternative model using formula_test, as shown below:
formula <- medv ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + k(crim) + k(lstat) formula_test <- medv ~ k(crim):k(lstat) fit_bos <- cvek(formula, kern_func_list = kern_func_list, data = Boston, formula_test = formula_test, lambda = exp(seq(-3, 5)), test = "asymp")
Given the fitted object (fit_bos), the p-value of the cvek test can be extracted as below:
fit_bos$pvalue
Since $p<0.05$, we reject the null hypothesis that there's no $crim:lstat$ interaction, and conclude that the data does suggest an impact of the crime rate on the relationship between the local socioeconomic status and the housing price.
A versatile and sometimes the most intepretable method for understanding interaction effects is via plotting. Next we show the Boston example of how to visualize the fitted interaction from a cvek model. We visualize the interaction effects by creating five datasets: Fix all confounding variables to their means, vary $lstat$ in a reasonable range (i.e., from $12.5$ to $17.5$, since the original range of $lstat$ in Boston dataset is $(1.73, 37.97)$), and respectively set $crim$ value to its $5\%, 25\%, 50\%, 75\%$ and $95\%$ quantiles.
# first fit the alternative model formula_alt <- medv ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + k(crim):k(lstat) fit_bos_alt <- cvek(formula = formula_alt, kern_func_list = kern_func_list, data = Boston, lambda = exp(seq(-3, 5))) # mean-center all confounding variables not involved in the interaction # so that the predicted values are more easily interpreted pred_name <- c("zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black") covar_mean <- apply(Boston, 2, mean) pred_cov <- covar_mean[pred_name] pred_cov_df <- t(as.data.frame(pred_cov)) lstat_list <- seq(12.5, 17.5, length.out = 100) crim_quantiles <- quantile(Boston$crim, probs = c(.05, .25, .5, .75, .95)) # crim is set to its 5% quantile data_test1 <- data.frame(pred_cov_df, lstat = lstat_list, crim = crim_quantiles[1]) data_test1_pred <- predict(fit_bos_alt, data_test1) # crim is set to its 25% quantile data_test2 <- data.frame(pred_cov_df, lstat = lstat_list, crim = crim_quantiles[2]) data_test2_pred <- predict(fit_bos_alt, data_test2) # crim is set to its 50% quantile data_test3 <- data.frame(pred_cov_df, lstat = lstat_list, crim = crim_quantiles[3]) data_test3_pred <- predict(fit_bos_alt, data_test3) # crim is set to its 75% quantile data_test4 <- data.frame(pred_cov_df, lstat = lstat_list, crim = crim_quantiles[4]) data_test4_pred <- predict(fit_bos_alt, data_test4) # crim is set to its 95% quantile data_test5 <- data.frame(pred_cov_df, lstat = lstat_list, crim = crim_quantiles[5]) data_test5_pred <- predict(fit_bos_alt, data_test5) # combine five sets of prediction data together medv <- rbind(data_test1_pred, data_test2_pred, data_test3_pred, data_test4_pred, data_test5_pred) data_pred <- data.frame(lstat = rep(lstat_list, 5), medv = medv, crim = rep(c("5% quantile", "25% quantile", "50% quantile", "75% quantile", "95% quantile"), each = 100)) data_pred$crim <- factor(data_pred$crim, levels = c("5% quantile", "25% quantile", "50% quantile", "75% quantile", "95% quantile")) data_label <- data_pred[which(data_pred$lstat == 17.5), ] data_label$value <- c("0.028%", "0.082%", "0.257%", "3.677%", "15.789%") data_label$value <- factor(data_label$value, levels = c("0.028%", "0.082%", "0.257%", "3.677%", "15.789%")) ggplot(data = data_pred, aes(x = lstat, y = medv, color = crim)) + geom_point(size = 0.1) + geom_text_repel(aes(label = value), data = data_label, color = "black", size = 3.6) + scale_colour_manual(values = c("firebrick1", "chocolate2", "darkolivegreen3", "skyblue2", "purple2")) + geom_line() + theme_set(theme_bw()) + theme(panel.grid = element_blank(), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), legend.title = element_text(size = 12, face = "bold"), legend.text = element_text(size = 12)) + labs(x = "percentage of lower status", y = "median value of owner-occupied homes ($1000)", col = "per capita crime rate")
The figure above shows the $medv$ - $lstat$ relationship under different levels of $crim$. Numbers at the end of each curves indicate the actual values of $crim$ rate (per capita crime rate by town) at the corresponding quantiles. From the figure we see that crime rate does impact the relationship between the local socioeconomic status v.s. housing price. Building on this code, user can continue to refine the visualization (e.g., by adding in confidence levels) and use it to improve the the model fit based on domain knowledge (e.g., by experimenting different kernels / hyper-parameters).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.