Nothing
#' Generalized linear model constraint on hierarchical structure
#' by using overlapped group penalty
#'
#' \code{smog} fits a linear non-penalized phynotype (demographic) variables such as
#' age, gender, treatment, etc, and penalized groups of prognostic effect (main effect)
#' and predictive effect (interaction effect), by satisfying the hierarchy structure:
#' if a predictive effect exists, its prognostic effect must be in the model. It can deal
#' with continuous, binomial or multinomial, and survival response variables, underlying
#' the assumption of Gaussian, binomial (multinomial), and Cox proportional hazard models,
#' respectively. It can accept \code{\link[stats]{formula}}, and output coefficients table,
#' fitted.values, and convergence information produced in the algorithm iterations.
#'
#' @param x a model matrix, or a data frame of dimensions n by p,
#' in which the columns represents the predictor variables.
#' @param y response variable, corresponds to the family description.
#' When family is ''gaussian'' or ''binomial'', \code{y} ought to
#' be a numeric vector of observations of length n; when family
#' is ''coxph'', \code{y} represents the survival objects, containing the
#' survival time and the censoring status. See \code{\link[survival]{Surv}}.
#' @param g a vector of group labels for the predictor variables.
#' @param v a vector of binary values, represents whether or not the
#' predictor variables are penalized. Note that 1 indicates
#' penalization and 0 for not penalization.
#' @param label a character vector, represents the type of predictors in terms of treatment,
#' prognostic, and predictive effects by using ''t'', ''prog'', and ''pred'',
#' respectively.
#' @param lambda1 penalty parameter for the L2 norm of each group of prognostic and predictive effects.
#' @param lambda2 ridge penalty parameter for the squared L2 norm of each group of prognostic and predictive effects.
#' @param lambda3 penalty parameter for the L1 norm of predictive effects.
#' @param family a description of the distribution family for the response
#' variable variable. For continuous response variable,
#' family is ''gaussian''; for multinomial or binary response
#' variable, family is ''binomial''; for survival response
#' variable, family is ''coxph'', respectively.
#' @param subset an optional vector specifying a subset of observations to be
#' used in the model fitting. Default is \code{NULL}.
#' @param rho the penalty parameter used in the alternating direction method
#' of multipliers (ADMM) algorithm. Default is 10.
#' @param scale whether or not scale the design matrix. Default is \code{TRUE}.
#' @param eabs the absolute tolerance used in the ADMM algorithm. Default is 1e-3.
#' @param erel the reletive tolerance used in the ADMM algorithm. Default is 1e-3.
#' @param LL initial value for the Lipschitz continuous constant for
#' approximation to the objective function in the Majorization-
#' Minimization (MM) (or iterative shrinkage-thresholding algorithm
#' (ISTA)). Default is 1.
#' @param eta gradient stepsize for the backtrack line search for the Lipschitz
#' continuous constant. Default is 1.25.
#' @param maxitr the maximum iterations for convergence in the ADMM algorithm.
#' Default is 1000.
#' @param formula an object of class ''formula'': a symbolic description of the
#' model to be fitted. Should not include the intercept.
#' @param data an optional data frame, containing the variables in the model.
#' @param ... other relevant arguments that can be supplied to smog.
#'
#' @return \code{smog} returns an object of class inhering from ''smog''. The
#' generic accessor functions \code{coef}, \code{coefficients},
#' \code{fitted.value}, and \code{predict} can be used to extract
#' various useful features of the value returned by \code{smog}.
#' An object of ''smog'' is a list containing at least the following
#' components:
#' \subsection{}{
#' \describe{
#' \item{coefficients}{Data frame containing the nonzero predictor
#' variables' indexes, names, and estimates. When
#' family is ''binomial'', the estimates have K-1
#' columns, each column representing the weights for the
#' corresponding group. The last group behaves the
#' ''pivot''.}
#' \item{fitted.values}{The fitted mean values for the response variable,
#' for family is ''gaussian''. When family is
#' ''binomial", the fitted.values are the probabilies
#' for each class; when family is ''coxph'',
#' the fitted.values are risk scores.}
#' \item{residuals}{The residual is trivial for family = "gaussian".
#' For family = "binomial", Pearson residuals is returned; and
#' for family = "coxph", it yields deviance residuals, i.e.,
#' standardized martingale residuals.}
#' \item{model}{A list of estimates for the intercept, treatment effect,
#' and prognostic and predictive effects for the selectd
#' biomarkers.}
#' \item{weight}{The weight of predictors resulted from the penalty funciton,
#' is used to calculate the degrees of freedom.}
#' \item{DF}{the degrees of freedom. When family = ''gaussian'',
#' \eqn{DF = tr(x_{\lambda}'(x_{\lambda}'x_{\lambda}+W)x_{\lambda})}.
#' For other families, DF is approximated by \eqn{diag(1/(1+W))}.}
#' \item{criteria}{model selection criteria, including the correction Akaike's Information
#' Criterion (AIC), AIC, Bayesian Information Criterion (BIC), and the generalized
#' cross-validation score (GCV), respectively. See also \code{\link{cv.smog}}.}
#' \item{llikelihood}{the log-likelihood value for the converged model.}
#' \item{loglike}{the penalized log-likelihood values for each
#' iteration in the algorithm.}
#' \item{PrimalError}{the averged norms \eqn{||\beta-Z||/\sqrt{p}} for each iteration,
#' in the ADMM algorithm.}
#' \item{DualError}{the averaged norms \eqn{||Z^{t+1}-Z^{t}||/\sqrt{p}} for
#' each iteration, in the ADMM algorithm.}
#' \item{converge}{the number of iterations processed in the ADMM algorithm.}
#' \item{call}{the matched call.}
#' \item{formula}{the formula supplied.}
#' }
#' }
#'
#' @details The formula has the form \code{response ~ 0 + terms} where \code{terms} is
#' a series of predictor variables to be fitted for \code{response}. For \code{gaussian}
#' family, the response is a continuous vector. For \code{binomial} family,
#' the response is a factor vector, in which the last level denotes the ''pivot''.
#' For \code{coxph} family, the response is a \code{\link[survival]{Surv}}
#' object, containing the survival time and censoring status.
#'
#' @section Penalized regression model: The regression function contains the non-penalized predictor variables,
#' and many groups of prognostic and predictive terms, where in each group the prognostic term comes first,
#' followed by the predictive term.
#'
#' * `Penalty function`: Different hierachical structures within groups can result from adjusting
#' the penalty parameters in the penalty function:
#'
#' \deqn{\Omega(\mathbf{\beta}) = \lambda_1||\mathbf{\beta}|| + \lambda_2||\mathbf{\beta}||^2+\lambda_3|\beta_2|}
#'
#' Where \eqn{\mathbf{\beta}=(\beta_1,\beta_2)}. Note that \eqn{\beta_1} denotes the prognostic effect
#' (main effect), and \eqn{\beta_2} for the predictive effect (interactive effect), respectively.
#' When \eqn{\lambda_2 = 0} and \eqn{\lambda_3 = 0}, it indicates no structure within groups. When
#' \eqn{\lambda_2 \ne 0}, the penalty function honors the structure within groups such that:
#' predictive effect \eqn{\ne 0 \Longrightarrow} prognostic effect \eqn{\ne 0}.
#'
#' * `Tuning parameters`: \code{rho,eabs,erel,LL,eta} are the corresponding parameters used in the
#' itervative shrinkage-thresholding algorithm (ISTA) and the alternating direction method of
#' multipliers algorithm (ADMM).
#'
#'
#' @references \insertRef{ma2019structural}{smog}
#'
#' @examples
#'
#' n=100;p=20
#' set.seed(2018)
#' # generate design matrix x
#' s=10
#' x=matrix(0,n,1+2*p)
#' x[,1]=sample(c(0,1),n,replace = TRUE)
#' x[,seq(2,1+2*p,2)]=matrix(rnorm(n*p),n,p)
#' x[,seq(3,1+2*p,2)]=x[,seq(2,1+2*p,2)]*x[,1]
#'
#' g=c(p+1,rep(1:p,rep(2,p))) # groups
#' v=c(0,rep(1,2*p)) # penalization status
#' label=c("t",rep(c("prog","pred"),p)) # type of predictor variables
#'
#' # generate beta
#' beta=c(rnorm(13,0,2),rep(0,ncol(x)-13))
#' beta[c(2,4,7,9)]=0
#'
#' # generate y
#' data1=x%*%beta
#' noise1=rnorm(n)
#' snr1=as.numeric(sqrt(var(data1)/(s*var(noise1))))
#' y1=data1+snr1*noise1
#' lfit1=smog(x,y1,g,v,label,lambda1=8,lambda2=0,lambda3=8,family = "gaussian")
#'
#' ## generate binomial data
#' prob=exp(as.matrix(x)%*%as.matrix(beta))/(1+exp(as.matrix(x)%*%as.matrix(beta)))
#' y2=ifelse(prob<0.5,0,1)
#' lfit2=smog(x,y2,g,v,label,lambda1=0.03,lambda2=0,lambda3=0.03,family = "binomial")
#'
#' ## generate survival data
#' # Weibull latent event times
#' lambda = 0.01; rho = 1
#' V = runif(n)
#' Tlat = (- log(V) / (lambda*exp(x %*% beta)) )^(1/rho)
#' C = rexp(n, 0.001) ## censoring time
#' time = as.vector(pmin(Tlat, C))
#' status = as.numeric(Tlat <= C)
#' y3 = as.matrix(cbind(time = time, status = status))
#'
#' lfit3=smog(x,y3,g,v,label,lambda1=0.2,lambda2=0,lambda3=0.2,family = "coxph")
#'
#' @importFrom tidyr spread
#' @importFrom magrittr %>% %<>%
#' @importFrom stats model.matrix
#' @import dplyr
#'
#' @export
smog.default <- function(x, y, g, v, label, lambda1, lambda2, lambda3, family = "gaussian", subset = NULL, rho = 10,
scale = TRUE, eabs = 1e-3, erel = 1e-3, LL = 1, eta = 1.25, maxitr = 1000, ...){
lambda=c(lambda1,lambda2,lambda3)
hierarchy = ifelse(lambda2 == 0,
ifelse(lambda3 == 0, 0, 1),2)
if(!is.null(subset)){
x <- as.matrix(as.matrix(x)[subset,])
y <- as.matrix(as.matrix(y)[subset,])
}else{
x <- as.matrix(x)
y <- as.matrix(y)
}
g <- as.numeric(as.factor(g))
est <- glog(y,x,g,v,lambda,hierarchy,family,rho,
scale,eabs,erel,LL,eta,maxitr)
if(nrow(est$coefficients)){ # continue for some variables are selected
if(family == "gaussian"){
wx <- cbind(rep(1,nrow(x)),x)
est$fitted.value = as.vector(as.matrix(wx[,est$coefficients$Id+1])%*%
as.matrix(est$coefficients$Estimate))
est$residuals = as.vector(y - est$fitted.value)
if(!is.null(colnames(x))){
est$coefficients$Beta = c("Intercept",colnames(x))[est$coefficients$Id+1]
est$coefficients = est$coefficients[,c("Id","Beta","Estimate")]
}
}
if(family == "binomial"){
probTAB = exp(as.matrix(x[,est$coefficients$Id])%*%as.matrix(est$coefficients$Estimate))/
(1+exp(as.matrix(x[,est$coefficients$Id])%*%as.matrix(est$coefficients$Estimate)))
probTAB = cbind(1-rowSums(as.matrix(probTAB)),probTAB)
predClass = apply(probTAB,1,which.max)
predProb = apply(probTAB,1,max)
est$levels = sort(unique(y))
est$fitted.value = data.frame(Class = est$levels[predClass],
Prob = predProb)
# calculate the pearson residuals
y_mat = model.matrix( ~ 0 + as.factor(y)) %>% data.frame %>% as.matrix
if(length(est$levels) > 2){
est$residuals = rowSums((y_mat[,-1] - probTAB[,-1])/sqrt(probTAB[,-1]*(1 - probTAB[,-1])))
} else{
est$residuals = (y_mat[,-1] - probTAB[,-1])/sqrt(probTAB[,-1]*(1 - probTAB[,-1]))
}
if(!is.null(colnames(x))){
est$coefficients$Beta = colnames(x)[est$coefficients$Id]
est$coefficients = est$coefficients[,c(1,ncol(probTAB)+1,2:ncol(probTAB))]
}
}
if(family == "coxph"){
risk = as.vector(exp(as.matrix(x[,est$coefficients$Id]) %*% as.matrix(est$coefficients$Estimate)))
# calculate baseline hazards
y %<>% data.frame
y_names = colnames(y)
# sort the survival time
y_sum = cbind(y, risk) %>%
dplyr::arrange(get(y_names[1])) %>%
dplyr::filter(get(y_names[2]) != 0) %>%
dplyr::mutate(cumrisk = rev(cumsum(rev(risk))))
y_expected = sapply(y[,1], function(z) sum(1/y_sum$cumrisk[y_sum[,1] <= z]))*risk
martingale = y[,2] - y_expected
est$residuals = sign(martingale)*sqrt(-2*(martingale + y[,2]*log(y[,2] - martingale)))
est$fitted = risk
if(!is.null(colnames(x))){
est$coefficients$Beta = colnames(x)[est$coefficients$Id]
est$coefficients = est$coefficients[,c("Id","Beta","Estimate")]
}
}
}
# calculate the degrees of freedom
idx = est$coefficients$Id
est$model$intercept = ifelse(0 %in% idx,
est$coefficients$Estimate[idx==0], NA)
est$model$treatment = ifelse(1 %in% idx,
est$coefficients$Estimate[idx==1], NA)
GId = estimate = elabel = NULL
if(any(idx>1)){
est$model$biomarker = data.frame(GId = g[idx[idx>1]],
estimate = est$coefficients$Estimate[idx>1],
elabel = label[idx[idx>1]])
est$model$biomarker = tidyr::spread(est$model$biomarker,elabel,estimate)
est$model$biomarker = as.data.frame(apply(est$model$biomarker,1:2,
function(t) ifelse(is.na(t),0,t)))
if("prog" %in% colnames(est$model$biomarker) &
(!("pred" %in% colnames(est$model$biomarker)))){
est$model$biomarker$pred = 0
}
if("pred" %in% colnames(est$model$biomarker) &
(!("prog" %in% colnames(est$model$biomarker)))){
est$model$biomarker$prog = 0
}
est$weight = data.frame(t(apply(est$model$biomarker[,c("prog","pred")],1,
function(t) lambda[1]/sqrt(sum(t^2))+
c(0,ifelse(sign(t[2]),lambda[3]/abs(t[2]),0)))))
colnames(est$weight) = c("prog","pred")
if(!is.null(colnames(x))){
est$model$biomarker$marker = colnames(x)[label=="prog" & (1:ncol(x) %in% idx[idx>1])]
est$model$biomarker = est$model$biomarker[,c("GId","marker","prog","pred")]
est$weight = cbind(est$model$biomarker[,c("GId","marker")],est$weight)
}else{
est$model$biomarker = est$model$biomarker[,c("GId","prog","pred")]
est$weight = cbind(GId = est$model$biomarker$GId,est$weight)
}
Weight = sapply(idx[idx >= 1],function(t) ifelse(t == 1,0,
est$weight[est$weight$GId == g[t],label[t]]))
if(family == "gaussian"){
est$DF = sum(diag(solve(as.matrix(t(x[,idx[idx>=1]]))%*%as.matrix(x[,idx[idx >= 1]])+diag(Weight))
%*%as.matrix(t(x[,idx[idx>=1]]))%*%as.matrix(x[,idx[idx >= 1]])))
}else{
est$DF = sum(1/(1+Weight))
}
}else{
if(!is.null(colnames(x))){
est$model$biomarker = data.frame(GId = NA, marker = NA, prog = NA, pred = NA)
}else{
est$model$biomarker=data.frame(GId = NA, prog = NA, pred = NA)
}
est$weight = NULL
est$DF = 1
}
# AIC, BIC, GCV criteria by using whole data
n <- nrow(x)
k = est$DF
aic1 = tryCatch({n/2*log(abs(2*est$llikelihood)) + n/2*((1+k/n)/(1-k+2/n))},
error=function(e) NA)
aic2 = tryCatch({log(abs(2*est$llikelihood)/n) + 2*k/n}, error=function(e) NA)
bic = tryCatch({log(abs(2*est$llikelihood)/n) + log(n)*k/n}, error=function(e) NA)
gcv = tryCatch({abs(2*est$llikelihood)/(n*(1-k/n)^2)}, error=function(e) NA)
est$criteria = list(aic1,aic2,bic,gcv)
names(est$criteria) = c("cAIC","AIC","BIC","GCV")
est$call <- match.call()
class(est) <- "smog"
est
}
#' @rdname smog.default
#'
#' @seealso \code{\link{cv.smog}}, \code{\link{predict.smog}}, \code{\link{plot.smog}}.
#' @author Chong Ma, \email{chongma8903@@gmail.com}.
#'
#' @importFrom stats model.frame model.matrix model.response
#'
#' @export
smog.formula <- function(formula, data=list(), g, v, label, lambda1, lambda2, lambda3, ...){
mf <- model.frame(formula = formula, data = data)
x <- model.matrix(attr(mf,"terms"),data = mf)
y <- model.response(mf)
est <- smog.default(x,y,g,v,label,lambda1,lambda2,lambda3,...)
est$call <- match.call()
est$formula <- formula
est
}
#' predict method for objects of the class smog
#'
#' \code{predict.smog} can produce the prediction for user-given new data, based on the
#' provided fitted model (\code{object}) in the S3method of \code{smog}. If the \code{newdata} omitted,
#' it would output the prediction for the fitted model itself. The yielded result should
#' match with the family in the provided model. See \code{\link{smog}}.
#'
#' @param object a fitted object of class inheriting from smog.
#' @param newdata a data frame containing the predictor variables, which are
#' used to predict. If omitted, the fitted linear predictors
#' are used.
#' @param family a description of distribution family for which the response
#' variable is to be predicted.
#' @param ... additional arguments affecting the predictions produced.
#'
#' @details If \code{newdata = NULL}, the fitted.value based on the \code{object}
#' is used for the prediction. For family = "coxph", the returned prediction value
#' is the risk score.
#'
#' @return If \code{family} = "gaussian", a vector of prediction for the response is returned.
#' For \code{family} = "coxph", a vector of predicted risk score is returned.
#' When \code{family} = ''binomial'', it outputs a data frame containing the predicted
#' group labels and the corresponding probabilies.
#'
#' @seealso \code{\link{smog.default}}, \code{\link{smog.formula}}, \code{\link{cv.smog}}, \code{\link{plot.smog}}.
#' @author Chong Ma, \email{chongma8903@@gmail.com}.
#'
#' @references \insertRef{ma2019structural}{smog}
#'
#' @importFrom stats coef fitted
#'
#' @export
predict.smog <- function(object, newdata = NULL, family = "gaussian",...){
if(is.null(newdata)){
y <- fitted(object)
}else{
if(!is.null(object$formula)){
x = model.matrix(object$formula, newdata)
}else{
x = newdata
}
if(nrow(coef(object))){
if(family == "gaussian"){
wx = cbind(rep(1,nrow(x)),x)
y <- as.vector(as.matrix(wx[,coef(object)$Id+1]) %*% as.matrix(coef(object)$Estimate))
}
if(family == "binomial"){
probTAB = exp(as.matrix(x[,coef(object)$Id])%*%as.matrix(coef(object)$Estimate))/
(1+exp(as.matrix(x[,coef(object)$Id])%*%as.matrix(coef(object)$Estimate)))
probTAB = cbind(1-rowSums(probTAB),probTAB)
predClass = apply(probTAB,1,which.max)
predProb = apply(probTAB,1,max)
y = data.frame(Class = object$levels[predClass],
Prob = predProb)
}
if(family == "coxph"){
# risk score
y = round(exp(as.matrix(x[,coef(object)$Id])%*%as.matrix(coef(object)$Estimate)),2)
}
}else{
stop("the model is not appropriate for making predictions")
}
}
y
}
#' plot method for objects of `smog` class
#'
#' \code{plot.smog} can produce a panel of plots for the primal errors, dual errors,
#' and the penalized log-likelihood values, based on the provided fitted model
#' (\code{x}) in the S3method of \code{smog}.
#'
#' @param x a fitted object of class inheriting from smog.
#' @param type,xlab default line types and x axis labels for the panel of plots.
#' @param caption a list of y axes labels for the panel of plots.
#' @param ... additional arguments that could be supplied to \code{\link[graphics]{plot.default}}
#' and \code{\link[graphics]{par}}.
#'
#' @details For the panel of three plots, the \code{xlab} is ''iterations'' and the
#' \code{type} is ''l'', by default. The \code{ylab} are ''primal error'',
#' ''dual error'',''log-likelihood'', respectively. This panel of plots can
#' reflect the convergence performance for the algorithm used in \code{\link{smog}}.
#'
#' @seealso \code{\link[graphics]{par}}, \code{\link[graphics]{plot.default}},
#' \code{\link{predict.smog}}, \code{\link{smog.default}}, \code{\link{cv.smog}}.
#'
#' @author Chong Ma, \email{chongma8903@@gmail.com}.
#'
#' @references \insertRef{ma2019structural}{smog}
#'
#' @importFrom graphics plot par
#'
#' @export
plot.smog <- function(x,type = "l",xlab="iteration",caption=list("primal error","dual error","log-likelihood"),...){
op<-graphics::par(mfrow=c(2,2),mar=c(2.1,4.1,2.1,2.1),no.readonly = TRUE,...)
graphics::plot(x$PrimalError, type=type, xlab = xlab, ylab = caption[[1]],...)
graphics::plot(x$DualError, type=type, xlab = xlab, ylab = caption[[2]],...)
graphics::plot(x$loglike, type=type, xlab = xlab, ylab = caption[[3]],...)
on.exit(par(op))
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.