In fenguoerbian/RSAVS: Robust Subgroup Analysis and Variable Selection

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this article, we use the student performance dataset to demonstrate how to perform subgroup analysis and variable selection simultaneously via the package RSAVS.

NOTE: This article is finished with RSAVS version 0.1.2.

library(ggplot2)
library(dplyr)
library(skimr)
remotes::install_github("fenguoerbian/RSAVS", ref = "v0.1.2")
library(RSAVS)

Introduction to the dataset

This student performance dataset is about the grading point of evaluations of 1044 Portuguese students from two core classes(Math and Portuguese). There are 395 observations in the math data frame mat_df and 649 observations in the Portuguese data frame por_df. And the full dataset full_df is just the combination of these two data frames resulting 1044 observations. You can find details about these datasets via ?full_df.

In later analysis, we will use the full dataset full_df. The basic information of this dataset is shown below:

skimr::skim(full_df)

Our interest is G3, the final grade. And we want to understand how it's related to those covariates. Note that although there seems to be several numeric variables in this datasets, most of them are actually binary or nominal coded.

First of all, we project the data on to the space spaned by most of the binary and nominal variables in the dataset. This gives us some basic understanding about how the average final grade is distributed among different observed subgroups.

fit1 <- lm(G3 ~ school + address + Pstatus + famsize + schoolsup + famsup + activities + paid + internet + nursery + higher + romantic + Mjob + Fjob + guardian + reason, data = full_df)
plot(density(fit1$fitted.values))

Looks like most of the grades are centered between 10-14 with some heavy tail at the lower end.

Next, we try expolore the relationship between individuals and the average grading point of the observed subgroup via OLS.

fit2 <- lm(fit1$fitted.values ~ sex + age + Medu + Fedu + traveltime + studytime + failures  + famrel + freetime + goout + Dalc + Walc + health + absences, 
           data = full_df)
plot(density(fit2$fitted.values))

The fitted value doesn't tell much, but the residual plot tells something

ggplot(fit2, aes(fit2$residuals)) + geom_density() + xlab(NULL)

It's skewed and potentially multimodal, which indicates a robust regression, along with a latent subgroup structure, would be appropriate.Thus, we analyze the dataset by the proposed method of this package with the L1 loss and the SCAD penalty in this study.

Robust subgroup analysis and variable selection

Before the analysis, we randomly sample 100 observations and denote this dataset as $D_2$, and use the remaining 944 observations as the training dataset $D_1$.

# prepare the full dataset
y_vec_full <- fit1$fitted.values
x_mat_full <- stats::model.matrix(fit2)[, -1]    # do not include the intercept term!

# sample 100 observations as test set
test_num <- 100
set.seed(1024)
test_idx <- sample(1 : nrow(full_df), size = test_num, replace = F)

# get the training set
x_mat <- x_mat_full[-test_idx, , drop = FALSE]
y_vec <- y_vec_full[-test_idx]
n <- nrow(x_mat)
p <- ncol(x_mat)

# get the test set
y_vec_test <- y_vec_full[test_idx]
x_mat_test <- x_mat_full[test_idx, , drop = FALSE]

Analysis on the training set

Set some basic variables for the proposed algorithm

lam1_len <- 40    # length of lam1_vec
lam2_len <- 30    # length of lam2_vec
phi <- 5    # constant needed for mBIC

Then we can apply our method at the training set. By default, it will use L1 loss function coupled with the SCAD penalty. You can use ?RSAVS_LargeN to find out more about this function.

# It will take some time!
res <- RSAVS_LargeN(y_vec = y_vec, x_mat = x_mat, 
                    lam1_length = lam1_len, lam2_length = lam2_len,
                    phi = phi, tol = 0.001, max_iter = 100)

The proposed algorithm can do subgroup analysis and variable selection simultaneously via regularization method. It will search over the grid space spanned by lam1 and lam2. And choose a best one according to a modified BIC criteria.

We can check the covariate effect with

best_ind <- res$best_ind
mu_new <- res$mu_improve_mat[res$best_ind, ]
beta_new <- res$w_mat[res$best_ind, ]
names(beta_new) <- colnames(x_mat)
beta_new

and the estimated subgroup information with

table(mu_new)

Also we can visualize the density plot of G3 for these identified subgroups

plot(density(y_vec))
for(i in 1 : length(unique(mu_new))){
  id <- which(mu_new == unique(mu_new)[i])
  abline(v = mean(y_vec[id]))
  abline(v = unique(mu_new)[id], lty = 2)
}

As we can see, the algorithm gives us a reasonable partition of the observations, but the estimation of subgroup effects mu_new is quite indistinguishable due to the presence of the penalty functions.

We can further improve these estimations with an ordinary robust regression without any penalty functions given the estimated parameter structure from the previous results. And we can perform this post-selection estimation over the whole solution plane. (Since there are lam1 and lam2, the regular solution path becomes a solution plane.)

mu_further_improve_mat <- matrix(0, nrow = lam1_len * lam2_len, ncol = n)
beta_further_improve_mat <- matrix(0, nrow = lam1_len * lam2_len, ncol = p)
bic_vec <- rep(0, lam1_len * lam2_len)
max_bic <- max(res$bic_mat)
for(i in 1 : (lam1_len * lam2_len)){
  tmp <- RSAVS_Further_Improve(y_vec = y_vec, x_mat = x_mat, mu_vec = res$mu_improve_mat[i, ], beta_vec = res$w_mat[i, ])
  mu_new <- as.vector(tmp$mu_vec)
  beta_new <- tmp$beta_vec
  mu_further_improve_mat[i, ] <- mu_new
  beta_further_improve_mat[i, ] <- beta_new
  bic_vec[i] <- RSAVS:::RSAVS_Compute_BIC(y_vec, x_mat, beta_new, mu_new, loss_type = "1", phi = phi)
  # update bic according to complexsity upper bound
  current_group_num <- length(unique(mu_new))
  current_active_num <- sum(beta_new != 0)
  if((current_active_num + current_group_num) >= 11){
    bic_vec[i] <- max_bic
  }
}

best_id <- which.min(bic_vec)
if(best_id == 1){
  best_id <- which.min(bic_vec[-1]) + 1
}
mu_new <- mu_further_improve_mat[best_id, ]
beta_new <- beta_further_improve_mat[best_id, ]
names(beta_new) <- colnames(x_mat)

After this further improving procedure, the estimated covariate effects are

beta_new

and the identified subgroups are

table(mu_new)

We can visualize the estimated subgroups. First we just need to prepare a data.frame for ggplot.

Mu_to_ID <- function(mu_vec, mu_unique){
  # This function converts the original mu vector to a resulting vector, with the same length.
  # But the entries is replaced with the index of this entry in unique(mu_vec), or the provided variable mu_unique

  if(missing(mu_unique)){
    mu_unique <- unique(mu_vec)
  }
  mu_vec <- sapply(mu_vec, FUN = function(x, unique_vec){
    return(which(unique_vec == x))
  }, unique_vec = mu_unique)
  return(mu_vec)
}

rsavs_res <- data.frame(target = y_vec, 
                        fitted = mu_new + x_mat %*% beta_new, 
                        residual = y_vec - mu_new - x_mat %*% beta_new, 
                        mu_new = mu_new, 
                        mu_id = as.factor(Mu_to_ID(mu_new, unique(mu_new))), 
                        course = c(rep("por", nrow(por_df)), rep("mat", nrow(mat_df)))[-test_idx])

Now we can show the density plot of the G3 given the estimated latent subgroups with the dashed line being the median of each group.

rsavs_res %>% group_by(mu_id) %>% 
  mutate(subgroup_median = median(fitted)) %>%
  ggplot(aes(target, fill = mu_id, color = mu_id, group = mu_id)) + 
  geom_density(alpha = 0.2) + 
  geom_vline(aes(xintercept = subgroup_median, color = mu_id), linetype = "dashed") + 
  xlab("Response") + labs(color = "Subgroup", fill = "Subgroup")

Some inference on the test set

Based on the results from training set, we can propose a model as $$ y_i = \alpha + \delta d_i + \text{Medu}_i \times \beta_1 + \text{Age}_i \times \beta_2 + \varepsilon_i $$ where $y_i$ refers to the G3 of the i-th subject and $d_i = 0$ if the i-th subject belongs to the subgroup 1 and otherwise $d_i = 0$. $\delta$ is the difference between these 2 subgroups.

We use regular K-medoids method to partition the dataset $D_2$ into two clusters with respect to the adjusted response with the covariate effects removed with the coefficients given by our method based on Dataset $D_1$

# residual_test is the adjusted response with the covariate effects removed
residual_test <- y_vec_test - x_mat_test %*% beta_new

# cluster residual_test into 2 groups with regular kmeans
tmp_kmeans <- stats::kmeans(residual_test, centers = 2)
group_indicator <- tmp_kmeans$cluster - 1

# form the simplified covariate matrix with only active covariates
x_mat_test_active <- x_mat_test[, which(beta_new != 0), drop = FALSE]

We can make inference about these parameters in the models with the simplified robust regression

rqfit <- quantreg::rq(y_vec_test ~ group_indicator + x_mat_test_active)
summary(rqfit, alpha = 0.05)

At 95% confidence level, the results implies that younger students may outperform older ones, and higher G3 scores can alos be expected for those children whose mothers have been given higher level education.

fenguoerbian/RSAVS documentation built on Oct. 25, 2024, 3:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com