In chensyustc/SC19027: The R package for StatComp Course by 19027

Overview

SC19027 is a simple R package developed for the 'Statistical Computing' course. It includes a data generation function datageneration (generates a dataset of linear model) and a linear regression function for scaled MCP or SCAD scalescad_mcp (implements the scaled penalized methods with the MCP or SCAD penalty). The Rcpp function resultC provides the bias of estimated sigma and average model size. We can compare the results of two methods.

This package also includes all my homework answers.

The data generation

The datageneration function generates a dataset of linear model.

The source R code for datageneration is as follows:

datageneration<-function(n,p,r0){
  S <- matrix(0,p,p)+1*diag(p)
  for(i in 1:50){
    for(j in 1:50){
      if(i!=j) S[i,j]<-r0
    }
  }
  R <- chol(S) #t(R)%*%R=S
  X <- scale(matrix(rnorm(n*p),n,p)%*%R)*sqrt(n-1)/sqrt(n)
  attributes(X) <- NULL
  X <- matrix(X,n,p)
  true_beta<-c(rep(1/sqrt(3),3),rep(0,p-3))
  Y <- X%*%true_beta+rnorm(n)
  return(list(X=X, Y=c(Y), true_beta = true_beta))
}

The dataset generated by this function includes design matrix X,response Y and coefficient beta. X has independent and identically distributed Gaussian rows with marginal distribution N(0,1),and correlated first 50 columns with correlation coefficient r0. The first three coefficients are nonzero,1/sqrt(3). The model has a normal noise level.

Set n=200,p=2000,r0=0.5.

library(SC19027)
n<-200;p<-2000
r0<-0.5
data<-datageneration(n,p,r0)
X<-data$X
Y<-data$Y
beta<-data$true_beta
round(X[1:10,1:10],3)
round(Y[1:10],3)
round(beta[1:10],3)

The scaled MCP and SCAD methods

The scalescad_mcp function implements the scaled penalized methods with the MCP or SCAD penalty.This method is put forward by Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression.Biometrika, 99:879–898. It chooses an equilibrium with a sparse regression method by iteratively estimating the noise level via the mean residual square and scaling the penalty in proportion to the estimated noise level. The iterative algorithm costs little beyond the computation of a path or grid of the sparse regression estimator for penalty levels above a proper threshold.

The source R code for scalescad_mcp is as follows:

scalescad_mcp<-function(X,y,lam0,method){
  X <- as.matrix(X)
  y <- as.numeric(y)
  nX=dim(X)[1]; pX=dim(X)[2]

  #calculation of gamma
  m1<-t(X)%*%X
  m<-m1-diag(diag(m1))
  gamma=2/(1-max(abs(m))/nX) 

  #iterative algorithm
  obj=picasso(X,y,method=method,gamma=gamma,intercept = FALSE)
  sigmaint=0.1; sigmanew=5; flag=0
  while(abs(sigmaint-sigmanew)>0.0001 & flag <= 100){
    flag=flag+1
    sigmaint=sigmanew; lam=lam0*sigmaint
    fit<-picasso(X,y,lambda = lam,method = method,gamma=gamma,intercept = FALSE)
    hy=as.numeric(X %*% fit$beta[,1])
    sigmanew=sqrt(mean((y-hy)^2))
  }
  hsigma=sigmanew; hlam=lam
  hbeta=picasso(X,y,lambda = lam,method = method,gamma=gamma,intercept = FALSE)$beta[,1]
  #results
  return(list(hsigma=hsigma,coefficients=hbeta))
}

The input needed is design matrix X,response y,initial penalty level lam0 and the penalty method we choose. The output values include the estimated noise level and the estimated coefficients. We can compare the estimated results with the true values of sigma and beta. Let's take an example.

library(SC19027)
n<-200;p<-2000
set.seed(123)
data<-datageneration(n,p,0.5)
X<-data$X
Y<-data$Y
lam0<-sqrt(2*log(p)/n)
mcp<-scalescad_mcp(X,Y,lam0,method="mcp")
sigma<-mcp$hsigma
beta<-mcp$coefficients
print(sigma)
print(beta[1:10])

Comparison

The Rcpp function resultC gives the bias of estimated sigma and average model size for the estimated linear model.

The source C++ code for resultC is as follows:

NumericVector resultC(double hsigma, NumericVector hbeta) {
      NumericVector out(2);
      out[0]=hsigma/1-1;
      int n=hbeta.size();
      int k=0;
      for(int i = 0; i < n; ++i) {
            if(hbeta[i]>0)   k=k+1;
       }
      out[1]=k;
     return out;
}

According to this function,we can compare two scaled methods.

n<-200;p<-2000
set.seed(124)
data<-datageneration(n,p,0)
X<-data$X
Y<-data$Y
lam0<-sqrt(2*log(p)/n)
mcp<-scalescad_mcp(X,Y,lam0,method="mcp")
scad<-scalescad_mcp(X,Y,lam0,method="scad")
r1<-resultC(mcp$hsigma,mcp$coefficients)
r2<-resultC(scad$hsigma,scad$coefficients)
a<-as.matrix(rbind(r1,r2),ncol=2)
colnames(a)<-c("bias","AMS")
rownames(a)<-c("scaled MCP","scaled SCAD")
a