knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In this article, we use the student performance dataset to demonstrate how to perform subgroup analysis and variable selection simultaneously via the package RSAVS
.
NOTE: This article is finished with RSAVS
version 0.1.2.
library(ggplot2) library(dplyr) library(skimr) remotes::install_github("fenguoerbian/RSAVS", ref = "v0.1.2") library(RSAVS)
This student performance dataset is about the grading point of evaluations of 1044 Portuguese students from two core classes(Math and Portuguese). There are 395 observations in the math data frame mat_df
and 649 observations in the Portuguese data frame por_df
. And the full dataset full_df
is just the combination of these two data frames resulting 1044 observations. You can find details about these datasets via ?full_df
.
In later analysis, we will use the full dataset full_df
. The basic information of this dataset is shown below:
skimr::skim(full_df)
Our interest is G3
, the final grade. And we want to understand how it's related to those covariates. Note that although there seems to be several numeric variables in this datasets, most of them are actually binary or nominal coded.
First of all, we project the data on to the space spaned by most of the binary and nominal variables in the dataset. This gives us some basic understanding about how the average final grade is distributed among different observed subgroups.
fit1 <- lm(G3 ~ school + address + Pstatus + famsize + schoolsup + famsup + activities + paid + internet + nursery + higher + romantic + Mjob + Fjob + guardian + reason, data = full_df) plot(density(fit1$fitted.values))
Looks like most of the grades are centered between 10-14 with some heavy tail at the lower end.
Next, we try expolore the relationship between individuals and the average grading point of the observed subgroup via OLS.
fit2 <- lm(fit1$fitted.values ~ sex + age + Medu + Fedu + traveltime + studytime + failures + famrel + freetime + goout + Dalc + Walc + health + absences, data = full_df) plot(density(fit2$fitted.values))
The fitted value doesn't tell much, but the residual plot tells something
ggplot(fit2, aes(fit2$residuals)) + geom_density() + xlab(NULL)
It's skewed and potentially multimodal, which indicates a robust regression, along with a latent subgroup structure, would be appropriate.Thus, we analyze the dataset by the proposed method of this package with the L1
loss and the SCAD
penalty in this study.
Before the analysis, we randomly sample 100 observations and denote this dataset as $D_2$, and use the remaining 944 observations as the training dataset $D_1$.
# prepare the full dataset y_vec_full <- fit1$fitted.values x_mat_full <- stats::model.matrix(fit2)[, -1] # do not include the intercept term! # sample 100 observations as test set test_num <- 100 set.seed(1024) test_idx <- sample(1 : nrow(full_df), size = test_num, replace = F) # get the training set x_mat <- x_mat_full[-test_idx, , drop = FALSE] y_vec <- y_vec_full[-test_idx] n <- nrow(x_mat) p <- ncol(x_mat) # get the test set y_vec_test <- y_vec_full[test_idx] x_mat_test <- x_mat_full[test_idx, , drop = FALSE]
Set some basic variables for the proposed algorithm
lam1_len <- 40 # length of lam1_vec lam2_len <- 30 # length of lam2_vec phi <- 5 # constant needed for mBIC
Then we can apply our method at the training set. By default, it will use L1
loss function coupled with the SCAD
penalty. You can use ?RSAVS_LargeN
to find out more about this function.
# It will take some time! res <- RSAVS_LargeN(y_vec = y_vec, x_mat = x_mat, lam1_length = lam1_len, lam2_length = lam2_len, phi = phi, tol = 0.001, max_iter = 100)
The proposed algorithm can do subgroup analysis and variable selection simultaneously via regularization method. It will search over the grid space spanned by lam1
and lam2
. And choose a best one according to a modified BIC criteria.
We can check the covariate effect with
best_ind <- res$best_ind mu_new <- res$mu_improve_mat[res$best_ind, ] beta_new <- res$w_mat[res$best_ind, ] names(beta_new) <- colnames(x_mat) beta_new
and the estimated subgroup information with
table(mu_new)
Also we can visualize the density plot of G3
for these identified subgroups
plot(density(y_vec)) for(i in 1 : length(unique(mu_new))){ id <- which(mu_new == unique(mu_new)[i]) abline(v = mean(y_vec[id])) abline(v = unique(mu_new)[id], lty = 2) }
As we can see, the algorithm gives us a reasonable partition of the observations, but the estimation of subgroup effects mu_new
is quite indistinguishable due to the presence of the penalty functions.
We can further improve these estimations with an ordinary robust regression without any penalty functions given the estimated parameter structure from the previous results. And we can perform this post-selection estimation over the whole solution plane. (Since there are lam1
and lam2
, the regular solution path becomes a solution plane.)
mu_further_improve_mat <- matrix(0, nrow = lam1_len * lam2_len, ncol = n) beta_further_improve_mat <- matrix(0, nrow = lam1_len * lam2_len, ncol = p) bic_vec <- rep(0, lam1_len * lam2_len) max_bic <- max(res$bic_mat) for(i in 1 : (lam1_len * lam2_len)){ tmp <- RSAVS_Further_Improve(y_vec = y_vec, x_mat = x_mat, mu_vec = res$mu_improve_mat[i, ], beta_vec = res$w_mat[i, ]) mu_new <- as.vector(tmp$mu_vec) beta_new <- tmp$beta_vec mu_further_improve_mat[i, ] <- mu_new beta_further_improve_mat[i, ] <- beta_new bic_vec[i] <- RSAVS:::RSAVS_Compute_BIC(y_vec, x_mat, beta_new, mu_new, loss_type = "1", phi = phi) # update bic according to complexsity upper bound current_group_num <- length(unique(mu_new)) current_active_num <- sum(beta_new != 0) if((current_active_num + current_group_num) >= 11){ bic_vec[i] <- max_bic } } best_id <- which.min(bic_vec) if(best_id == 1){ best_id <- which.min(bic_vec[-1]) + 1 } mu_new <- mu_further_improve_mat[best_id, ] beta_new <- beta_further_improve_mat[best_id, ] names(beta_new) <- colnames(x_mat)
After this further improving procedure, the estimated covariate effects are
beta_new
and the identified subgroups are
table(mu_new)
We can visualize the estimated subgroups. First we just need to prepare a data.frame for ggplot
.
Mu_to_ID <- function(mu_vec, mu_unique){ # This function converts the original mu vector to a resulting vector, with the same length. # But the entries is replaced with the index of this entry in unique(mu_vec), or the provided variable mu_unique if(missing(mu_unique)){ mu_unique <- unique(mu_vec) } mu_vec <- sapply(mu_vec, FUN = function(x, unique_vec){ return(which(unique_vec == x)) }, unique_vec = mu_unique) return(mu_vec) } rsavs_res <- data.frame(target = y_vec, fitted = mu_new + x_mat %*% beta_new, residual = y_vec - mu_new - x_mat %*% beta_new, mu_new = mu_new, mu_id = as.factor(Mu_to_ID(mu_new, unique(mu_new))), course = c(rep("por", nrow(por_df)), rep("mat", nrow(mat_df)))[-test_idx])
Now we can show the density plot of the G3
given the estimated latent subgroups with the dashed line being the median of each group.
rsavs_res %>% group_by(mu_id) %>% mutate(subgroup_median = median(fitted)) %>% ggplot(aes(target, fill = mu_id, color = mu_id, group = mu_id)) + geom_density(alpha = 0.2) + geom_vline(aes(xintercept = subgroup_median, color = mu_id), linetype = "dashed") + xlab("Response") + labs(color = "Subgroup", fill = "Subgroup")
Based on the results from training set, we can propose a model as
$$
y_i = \alpha + \delta d_i + \text{Medu}_i \times \beta_1 + \text{Age}_i \times \beta_2 + \varepsilon_i
$$
where $y_i$ refers to the G3
of the i-th subject and $d_i = 0$ if the i-th subject belongs to the subgroup 1 and otherwise $d_i = 0$. $\delta$ is the difference between these 2 subgroups.
We use regular K-medoids method to partition the dataset $D_2$ into two clusters with respect to the adjusted response with the covariate effects removed with the coefficients given by our method based on Dataset $D_1$
# residual_test is the adjusted response with the covariate effects removed residual_test <- y_vec_test - x_mat_test %*% beta_new # cluster residual_test into 2 groups with regular kmeans tmp_kmeans <- stats::kmeans(residual_test, centers = 2) group_indicator <- tmp_kmeans$cluster - 1 # form the simplified covariate matrix with only active covariates x_mat_test_active <- x_mat_test[, which(beta_new != 0), drop = FALSE]
We can make inference about these parameters in the models with the simplified robust regression
rqfit <- quantreg::rq(y_vec_test ~ group_indicator + x_mat_test_active) summary(rqfit, alpha = 0.05)
At 95% confidence level, the results implies that younger students may outperform older ones, and higher G3
scores can alos be expected for those children whose mothers have been given higher level education.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.