knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(extendedglmnetGroup7)

Introduction

The extendedglmnetGroup7 package was written by Victoria Hamilton and Scott Nelson as part of a final project for AMS 597 at Stony Brook University in Spring 2022. The primary purpose of the package is to extend the glmnet package to inlude random lasso as described in Wang, S., Nan, B., Rosset, S., & Zhu, J. (2011). Random lasso. The Annals of Applied Statistics, 5(1), 468-485.

The extendedglmnetGroup7 package also includes functions for doing basic linear and logistic regression which are not included in the glmnet package, as well as functions that by default perform ridge and lasso regression. In total, the extendedglmnetGroup7 package can fit the following types of models:

Linear Regression

Linear regression models a linear relation between a response variable and a set of predictor variables. These types of models take the form: $$y_i = \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip}+\epsilon$$ where $y_i$ is the $i$th response, $\beta_0,\beta_1,\cdots,\beta_p$ is a parameter vector where $\beta_0$ is the intercept and the remaining terms are coefficients for the various predictor variables, $x_{i1},\cdots,x_{ip}$ are observed values for the various parameters of the $i$th data point, and $\epsilon$ is random noise. Given a set of observations and responses, the goal of linear regression is to find the coefficients that minimize an error term (often the sum of squared errors.)

A linear regression model can be fit using the function extended.lm(). This function is a wrapper for the base lm() function. Unlike the lm() function, extended.lm() takes two variables and does not require a formula. X is an $n\times p$ matrix where $n$ is the number of observations and $p$ is the number of predictors. Y is a vector of length $n$ continuous response values. The remainder of the input variables used with lm() are kept as their default values. For more information on lm() see the package documentation.

Example

First we create some data.

n <- 1000
B0 <- 3
B1 <- 5
X1 <- rnorm(n)
B2 <- -2
X2 <- rnorm(n)
Y <- B0 + B1*X1 + B2*X2 + rnorm(n,0,10)
X <- as.matrix(cbind(X1=X1,X2=X2))

Next, we can fit the model using extended.lm() and check to see the predicted coefficients.

fit <- extended.lm(X,Y)
coef(fit)

As expected, the fitted coefficients are close to the true coefficients B0, B1, and B2.

Logistic Regression

Logistic regression is a type of generalized linear regression that models binary outcomes. Generalized linear regression models are characterized by a response distribution (binomial) and a link function which relates linear predictors and the mean of the response distribution. $$ E(X) = \mu = g^{-1}(X\beta) $$

In regression, we want to explain a response variable $Y$ as a function of predictors $X$. For logistic regression with a binary response we need to restrict $Y$ to be between 0 and 1. This is accomplished with the following model: $$ E(Y) = \mu = \frac{\exp(X\beta)}{1+\exp(X\beta)} $$

Finally, the linking function is the logit function which is defined as: $$ X\beta = \log(\frac{\mu}{1-\mu}) $$

A logistic regression model can be fit using the function extended.glm(). This function is a wrapper for the base glm() function. Unlike the glm() function, extended.glm() takes two variables and does not require a formula. X is an $n\times p$ matrix where $n$ is the number of observations and $p$ is the number of predictors. Y is a vector of length $n$ binary response values. The function sets family=binomial("logit"). The remainder of the input variables used with glm() are kept as their default values. For more information on glm() see the package documentation.

Example

We can use the same X and B values from the data created in the linear regression section. All we need to change is how Y is derived.

Z <- B0 + B1*X1 + B2*X2
pr=exp(Z)/(1+exp(Z))
Y2 <- rbinom(n,1,pr)

Next, we can fit the model using extended.glm() and check to see the predicted coefficients.

fit <- extended.glm(X,Y2)
coef(fit)

As expected, the fitted coefficients are close to the true coefficients B0, B1, and B2.

Ridge Regression

In linear regression, the coefficients are determined by minimizing the residual sum of squares. That is, it finds a vector $\hat{\beta}$ that minimizes: $$ RSS = \sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2 $$ Ridge regression extends this by adding a non-negative tuning parameter $\lambda$ that is added to each non-intercept coefficient in $\hat{\beta}$. Ridge regression therefore minimizes: $$ RSS + \lambda\sum_{j=1}^p\beta_j^2 $$ As $\lambda$ goes to 0, ridge regression becomes normal linear regression. As $\lambda$ goes to positive infinity, the $\hat{\beta}$ coefficients approach 0. The advantage of ridge regression is that it reduces variance, but this is done by slightly increasing *bias$ in the model.

A ridge regression model can be fit using the function extended.ridge(). This function is a wrapper for the function glmnet::glmnet() with the argument alpha=0 hard coded in. The alpha parameter for glmnet::glmnet() indicates the type of penalty to be used in the model. 0 is the ridge penalty. Like extended.lm() and extended.glm(), extended.ridge() takes separate inputs for the predictors and responses. X is an $n\times p$ matrix where $n$ is the number of observations and $p$ is the number of predictors. Y is a vector of length $n$ of either binary or continuous response values.

extended.ridge() also can take a user provided lambda value. By default this variable is set to NULL. If a lambda value is not provided, extended.ridge() automatically determines a value using `glmnet::cv.glmnet(alpha=0,nfolds=5). This is the suggested method for users. The final argument to extended.ridge() is ytype. This indicates whether or not the response variable is continuous or binary. The possible forms of this argument are "continuous" and "binary". The default argument is continuous. These control the family argument of the glmnet::glmnet() function. If ytype="continuous" then family="gaussian". If ytype="binary" then family="binomial". All other glmnet::glmnet() arguments are kept as their default values. For mor information on glmnet::glmnet() see the package documentation or vignette.

Example

For this example we will use the leukemia gene data set from http://www.ams.sunysb.edu/~pfkuan/Teaching/AMS597/Data/leukemiaDataSet.txt. This is saved as a part of the extendedglmGroup7 package as dat. This data contains 72 binary observations and 3571 predictors.

X <- dat[,2:ncol(dat)]
Y <- dat[,1]

fit <- extended.ridge(X,Y,ytype="binary")
coef(fit)[1:21]

We can look at the intercept and the next 20 coefficients and see that they are all small, but nonzero.

Let's use the function create.regr.data() for a second example. We will make the data set the same size as the gene data set above. This time, we will also provide our own lambda value.

dd <- create.regr.data(72,3751)$Data
X2 <- dd[,1:(ncol(dd)-1)]
Y2 <- dd[,c(ncol(dd))]

fit <- extended.ridge(X2,Y2,lambda=100,ytype="continuous")
coef(fit)[1:21]

We once again look at the intercept and the next 20 coefficients and see that many are small, but nonzero.

Lasso Regression

Lasso regression is similar to ridge regression. The only difference is how the penalty term is calculated. In ridge regression, $\lambda$ is multiplied by the square of each $\beta$ coefficient while with lasso regression, $\lambda$ is multiplied by the absolute value of each $\beta$ coefficient during the minimization process. Therefore, lasso regression minimizes: $$RSS + \lambda\sum_{j=1}^p|\beta_j|$$ This causes the coefficients to shrink exactly to zero. Lasso regression can therefore be used as a type of predictor selection process.

A lasso regression model can be fit using the function extended.lasso(). This function is a wrapper for the function glmnet::glmnet() with the argument alpha=1 hard coded in. The alpha parameter for glmnet::glmnet() indicates the type of penalty to be used in the model. 1 is the lasso penalty. Like extended.lm() and extended.glm(), extended.lasso() takes separate inputs for the predictors and responses. X is an $n\times p$ matrix where $n$ is the number of observations and $p$ is the number of predictors. Y is a vector of length $n$ of either binary or continuous response values.

extended.lasso() also can take a user provided lambda value. By default this variable is set to NULL. If a lambda value is not provided, extended.ridge() automatically determines a value using `glmnet::cv.glmnet(alpha=1,nfolds=5). This is the suggested method for users. The final argument to extended.lassoe() is ytype. This indicates whether or not the response variable is continuous or binary. The possible forms of this argument are "continuous" and "binary". The default argument is continuous. These control the family argument of the glmnet::glmnet() function. If ytype="continuous" then family="gaussian". If ytype="binary" then family="binomial". All other glmnet::glmnet() arguments are kept as their default values. For mor information on glmnet::glmnet() see the package documentation or vignette.

Example

For this example we will use the leukemia gene data set from http://www.ams.sunysb.edu/~pfkuan/Teaching/AMS597/Data/leukemiaDataSet.txt. This is saved as a part of the extendedglmGroup7 package as dat. This data contains 72 binary observations and 3571 predictors.

X <- dat[,2:ncol(dat)]
Y <- dat[,1]

fit <- extended.lasso(X,Y,ytype="binary")
nonzero.c <- length(coef(fit)[coef(fit) != 0])
paste("There are",nonzero.c-1,"nonzero coefficients")

We can look to see how many coefficients are nonzero. Notice that with this method we drastically reduce the number of predictors that affect the outcome.

Let's use the function create.regr.data() for a second example. We will make the data set the same size as the gene data set above. This time, we will also provide our own lambda value.

dd <- create.regr.data(72,3751)$Data
X2 <- dd[,1:(ncol(dd)-1)]
Y2 <- dd[,c(ncol(dd))]

fit <- extended.lasso(X2,Y2,lambda=100,ytype="continuous")
nonzero.c <- length(coef(fit)[coef(fit) != 0])
paste("There are",nonzero.c-1,"nonzero coefficients")

We once again see that with this method we drastically reduce the number of predictors that affect the outcome..

Random Lasso Regression

Random lasso is an extension of lasso regression. With random lasso, there are two steps. The first step performs lasso regression on a randomly selected group of predictors of Bootstrap samples. This is done to create an importance measure for each predictor. In step two, lasso regression is once again performed on a randomly selected group of predictors, but this time the probability of a predictor being chosen is weighted by its importance measure in step one. In regular lasso, if two or more predictors are heavily correlated, only one is chosen. This is less likely to happen with random lasso.

extendedglmnetGroup7 has a custom implementation of random lasso regression based on the algorithm in Wang et al. (2012). The implementation of the algorithm uses glmnet::glmnet(alpha=1) for the lasso regression. The algorithm assumes numerical predictors that have been mean centered and therefore fits an intercept-free model. By default glmnet::glmnet() centers the data with the argument standardize=TRUE. An extra argument intercept=FALSE was included in the implementation of the algorithm to ensure that it fit an intercept-free model.

Random lasso is implemented with the function extended.randomlasso(). The following are the arguments that the function takes. - X is an $n\times p$ matrix where $n$ is the number of observations and $p$ is the number of predictors. - Y is a vector of length $n$ of either binary or continuous response values. - lambdais a non-negative tuning parameter. The user may provide their own value, but it defaults to NULL. If the user does not provide their own lambda term, glmnet::cvglmnet is used to find the best term using cross-validation. - B is the number of bootstrap samples to use in steps 1 and 2 of the algorithm. It defaults to 200 since Wang et al. (p. 474) found this number to be sufficient for their purposes. - q1 is the number of random predictors to sample in step 1 of the algorithm. It defaults to length(Y) which is equal to the number of observations $n$. This is the largest number that q1 can be. If the user provides a value larger than $n$, the function returs an error. - q2 is the number of random predictors to sample in step 2 of the algorithm. It defaults to length(Y) which is equal to the number of observations $n$. This is the largest number that q1 can be. If the user provides a value larger than $n$, the function returns an error. q1 does not need to equal q2. - ytype defines the type of response variable. The two options are c("binary","continuous"). It defaults to "continuous". If the user does not provide either of the two expected responses, the function returns an error.

The function returns a list B.coefs with a coefficient value for each predictor.

Example

We start with the leukemia data set. No lambda value is provided and q1 and q2 remain the default values.

X <- dat[,2:ncol(dat)]
Y <- dat[,1]

fit <- extended.randomlasso(X,Y,ytype="binary")
nonzero.c <- length(fit$B.coefs[fit$B.coefs != 0])
paste("There are",nonzero.c,"nonzero coefficients")

The number of nonzero coefficients is once again much smaller than the number of predictors.

We return to the randomly generated data. This time we provide a q1 and q2 value.

dd <- create.regr.data(72,3751)$Data
X2 <- dd[,1:(ncol(dd)-1)]
Y2 <- dd[,c(ncol(dd))]

fit <- extended.randomlasso(X2,Y2,q1=10,q2=5,ytype="continuous")
nonzero.c <- length(fit$B.coefs[fit$B.coefs != 0])
paste("There are",nonzero.c,"nonzero coefficients")

Here we see that the smaller q values also influence the number of nonzero coefficients.

Auxiliary functions and data

Gene Data

One data set is included in the package. This is a leukemia gene data set taken from http://www.ams.sunysb.edu/~pfkuan/Teaching/AMS597/Data/leukemiaDataSet.txt. This data set is useful for lasso regression because the number of predictors (72) is much larger than the number of observations (3571). The response variable is binary with the responses being ALL and AML.

This data is saved as a .rda file and is named dat. It is loaded automatically with the package.

Random Data Generator

The extendedglmnetGroup7 package also includes a function create.regr.data() that automatically creates random regression data. This can be useful for testing to see how different types of data is evaluated with different types of regression. The following parameters are included: - n the number of desired observations - p the number of desired parameters - ytype whether or not the response variable should be binary or continuous. It defaults to continuous - include.categorical a boolean variable deciding whether or not to include categorical predictors

It returns a list with the true beta values used in data creation (True_Betas) and the data itself (Data).

Example

The following examples show different data generation types. 1) n > p, continuous response, no categorical predictors; 2) n < p, binary response, categorical predictors, n = p, continuous response, categorical predictors

d1 <- create.regr.data(1000,2)$Data
d2 <- create.regr.data(50,100,ytype="binary",include.categorical = TRUE)$Data
d3 <- create.regr.data(100,100,include.categorical = TRUE)$Data

Train/Test Split

The extendedglmnetGroup7 package also includes a function train_test_split() that automatically splits a data set into a training and testing set. This can be useful for testing to see how well a regression model generalizes. The following parameters are included: - data a data frame our matrix of both predictor values and response - test_size a real value in (0,1) that determines the size of the test set. Default is 0.2.

It returns a list of Train_data and Test_data.

Example

df <- create.regr.data(100,4)$Data


snelson89/extendedglmnetGroup7 documentation built on May 12, 2022, 7:38 p.m.