In dajmcdon/ubc-stat406-labs: Tutorials and labs for UBC Stat 406 in the 2020-2021 online year

library(knitr)
opts_chunk$set(echo=TRUE, fig.align='center', fig.width=4, fig.height=3,
               cache=TRUE, autodep=TRUE, cache.comments=FALSE,
               message=FALSE, warning=FALSE)

Generate data from a linear model without intercept. Let $X_{ij}$ and $\epsilon_i$ be iid standard Gaussian. Set $p=30$ and $n=100$. Set the first 5 $\beta_j = 1$ ($j=1,\ldots,5$), the rest are zero.

library(glmnet)
set.seed(20181023)
n=100
p=30
X = matrix(rnorm(n*p),n)
beta = c(rep(1,5),rep(0,p-5))
Y = c(X %*% beta + rnorm(n))

Fit LASSO to this data set. Calculate CV, AIC, and BIC. One can show that an unbiased estimator of the degrees of freedom is the number of selected covariates. Use the "unknown variance" version, ($\log(RSS)$)

rss <- function(cvglmnetobj, Y, X){
  colMeans( (Y - predict(cvglmnetobj$glmnet.fit, newx=X))^2)
}

aic <- function(cvglmnetobj, Y,  X){
  like = rss(cvglmnetobj, Y,  X)
  df = colSums(abs(out$glmnet.fit$beta) > 0 )
  log(like) + 2*(df+1)/length(Y)
}

bic <- function(cvglmnetobj, Y,  X){
  n = length(Y)
  like = rss(cvglmnetobj, Y,  X)
  df = colSums(abs(out$glmnet.fit$beta) > 0 )
  log(like) + 2*(df+1)*log(n)/n
}

out = cv.glmnet(X,Y, intercept=FALSE)
bestmodels = list(
  AIC = out$glmnet.fit$beta[,which.min(aic(out, Y, X))],
  BIC = out$glmnet.fit$beta[,which.min(bic(out, Y, X))],
  CV = c(coef(out, s='lambda.min')[-1,])
)

For each model selection crieterion, how many of the correct coefficients do you select? How many incorrect?

errs <- function(b,beta){
  TP = sum((abs(b) > 0 & abs(beta)>0))
  FP = sum((abs(b)>0 & beta==0))
  c(TP,FP)
}
tab = sapply(bestmodels,errs,beta=beta)
rownames(tab) = c('true positives','false positives')
kable(tab)

Repeat the same exercise but using $p=300$.

library(glmnet)
set.seed(20181023)
n=100
p=300
X = matrix(rnorm(n*p),n)
beta = c(rep(1,5),rep(0,p-5))
Y = c(X %*% beta + rnorm(n))
out = cv.glmnet(X,Y, intercept=FALSE)
bestmodels = list(
  AIC = out$glmnet.fit$beta[,which.min(aic(out, Y, X))],
  BIC = out$glmnet.fit$beta[,which.min(bic(out, Y, X))],
  CV = c(coef(out, s='lambda.min')[-1,])
)
tab = sapply(bestmodels,errs,beta=beta)
rownames(tab) = c('true positives','false positives')
kable(tab)

What do you notice?

There's not a huge difference in low dimensions. AIC and CV are essentially the same, picking slightly larger models than BIC. With $p=300$, AIC drastically over-selects (it chooses the largest model it can, based on the path). BIC does quite well, CV slightly over-selects. In all cases, the true coefficients are always included (the signal-to-noise ratio is pretty large).

dajmcdon/ubc-stat406-labs documentation built on Aug. 18, 2020, 1:23 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dajmcdon/ubc-stat406-labs
Tutorials and labs for UBC Stat 406 in the 2020-2021 online year

In dajmcdon/ubc-stat406-labs: Tutorials and labs for UBC Stat 406 in the 2020-2021 online year

R Package Documentation

Browse R Packages

We want your feedback!

dajmcdon/ubc-stat406-labs Tutorials and labs for UBC Stat 406 in the 2020-2021 online year

In dajmcdon/ubc-stat406-labs: Tutorials and labs for UBC Stat 406 in the 2020-2021 online year

R Package Documentation

Browse R Packages

We want your feedback!

dajmcdon/ubc-stat406-labs
Tutorials and labs for UBC Stat 406 in the 2020-2021 online year