pmlr:

Documented in pmlr

#' Penalized maximum likelihood estimation for multinomial logistic regression using the Jeffreys prior
#' @aliases pmlr
#' @aliases print.pmlr
#'
#' @description Extends the approach proposed by Firth (1993) for bias reduction of MLEs
#' in exponential family models to the multinomial logistic regression model with general
#' covariate types.  Modification of the logistic regression score function to remove
#' first-order bias is equivalent to penalizing the likelihood by the Jeffreys prior,
#' and yields penalized maximum likelihood estimates (PLEs) that always exist.
#' Constrained hypothesis testing is conducted via likelihood ratio statistics.
#' Profile or Wald confidence intervals (CI) are constructed for the PLEs.
#'
#' @param formula an object of class \code{\link{formula}} (or one that can be coerced to that class):
#' a symbolic description of the model to be fitted.
#' Typically, \code{formula} has the form \code{response ~ terms} where \code{response} is either a factor
#' with \eqn{J+1} levels (the first is used as the baseline category) or a \eqn{J}-column indicator matrix,
#' and \code{terms} is a series of terms which specifies a linear predictor for the response.
#' @param data a data frame containing the variables in the model
#' @param weights an optional vector of weights to be used in the fitting process.
#' Should be a numeric vector of counts in the case of frequency data.
#' @param penalized a logical variable indicating whether penalized maximum
#' likelihood should be used (default)
#' @param method a character string specifying whether p-values and confidence intervals should be based
#' on the profile likelihood, score or the Wald method. Must be one of \dQuote{likelihood} (default),
#' \dQuote{wald}, \dQuote{score} or \dQuote{none}. Note that the \dQuote{none} option provides parameter
#' estimates and Wald standard errors only, without confidence intervals or hypothesis tests.
#' This option may be useful for diagnostic purposes. In addition, the \dQuote{score} option provides
#' hypothesis tests only due to the rarity of score based confidence limits in practice.
#' @param joint a logical variable indicating whether joint hypothesis tests should be performed
#' in addition to individual parameter tests.
#' If \code{TRUE}, the following three (constrained) hypotheses are tested additionally
#' for covariates \eqn{p = 1, \ldots, P}{p = 1, ..., P}:
#' \deqn{H_{0}: \beta_{p1} = \cdots = \beta_{pJ} = 0}{H[0]: \beta[p1] = ... = \beta[pJ] = 0;}
#' \deqn{H_0: \beta_{p1} = \cdots = \beta_{pJ}}{H[0]: \beta[p1] = ... = \beta[pJ];}
#' \deqn{H_0: b_{p,J} = J*b_{p,1}, b_{p,J-1} = (J-1)*b_{p,1}, ... , b_{p,2} = 2*b_{p,1}}{H[0]: b[p,J] = J b[p,1], b[p,J-1] = (J-1) b[p,1], ... , b[p,2] = 2 b[p,1],}
#' where \eqn{b_{p,j}}{b[p,j]} is equivalent to \eqn{\beta_{pj}}{\beta[pj]}.
#' This is meaningful only when the number \eqn{J+1} of response categories is greater than 2.
#' Thus, an error will occur if this is set to TRUE and binomial data is entered into the function.
#' @param CI.calculate a logical variable whether the profile (if \code{method="likelihood"}) or Wald (if \code{method="wald"})
#' confidence limits should be computed. Default is \code{FALSE}.
#' @param CI.alpha the significance level for the profile or Wald confidence intervals (default is 0.05).
#' Meaningful only if \code{CI.calcuate=TRUE}.
#' @param tol The tolerance for convergence can be specified for diagnostic purposes.
#' The default is 1e-6.  Note that this does not correspond (exactly) to the precision in each estimate,
#' as the tolerance is defined as the sum of the absolute value of the elements of the `step'
#' \eqn{=A^{-1}_{(t)}U(B^{*}_{(t)})}{=(A^-1[(t)])U(B^*[(t)])}, where \eqn{A^-1 [(t)]} is inverse of the Fisher's information matrix,
#' and \eqn{B^{*}_{(t)}}{B^* [(t)]} is the vector of the penalized coefficient estimates, both evaluated at the \eqn{t}-th iteration
#' in the modified Newton-Raphson algorithm.
#' Nonetheless, higher tolerance values may reduce computation times and facilitate convergence in some cases.
#' @param verbose Logical: TRUE displays parameter values after each iteration and states when completion of each routine has occurred.
#' Default is \code{FALSE}.
#'
#' @details
#' Logistic regression is one of the most widely used regression models in practice,
#' but alternatives to conventional maximum likelihood estimation methods may be more
#' appropriate for small or sparse samples.
#' It is well known that the usual maximum likelihood estimates (MLEs) of the log odds ratios
#' are biased in finite samples, and there is a non-zero probability that an MLE is infinite
#' (i.e., does not exist).
#' This corresponds to the problem of separation (Lesaffre and Albert, 1989).
#'
#' In this package, we extend the approach proposed by Firth (1993)
#' for bias reduction of MLEs in exponential family models to the multinomial logistic regression
#' model with general covariate types.  Modification of the logistic regression score function to
#' remove first order bias is equivalent to penalizing the likelihood by the Jeffreys prior,
#' and yields penalized likelihood estimates (PLEs) that always exist, even in samples
#' in which MLEs are infinite.  PLEs are an attractive alternative in small to moderate-sized
#' samples, and are preferred to exact conditional MLEs when there are continuous covariates.
#'
#' We consider a multicategory outcome \eqn{y} that is a multinomial variable with \eqn{J + 1}
#' categories.  For each category \eqn{j (j = 1, \ldots, J)} there is a regression function
#' in which the log odds of response in category \eqn{j}, relative to category 0, is a linear
#' function of regression parameters and a vector \eqn{\bold{x}}{x} of \eqn{P} covariates
#' (including a constant):
#' \eqn{\log\{\mathrm{prob}(y = j| \bold{x})/\mathrm{prob}(y = 0 | \bold{x})\} = \bold{\beta}_{j}^T \bold{x}}{log{prob(y = j | x)/prob(y = 0 | x)} = \beta[j]^T x}.
#' Let \eqn{\bold{y}_i}{y[i]} be a \eqn{J \times 1}{Jx1} vector of indicators for the observed response
#' category for observation \eqn{i}, with the corresponding \eqn{J \times 1}{Jx1}
#' vector of probabilities \eqn{\bold{\Theta}_{j} = (\Theta_{1j}, \ldots, \Theta_{Pj})^T}{\Theta[j] = (\Theta[1j], ..., \Theta[Pj])^T}.
#' The vector of MLEs, \eqn{\hat{B} = vec[(\bold{\hat{\beta}}_1, \ldots, \bold{\hat{\beta}}_J)^T]}{\hat{B} = vec[(\hat{\beta}[1], ..., \hat{\beta}[J])^T]},
#' is estimated from observations \eqn{(\bold{y}_i,\bold{x}_i), i = 1, \ldots, n}{(y[i],x[i]), i = 1, ..., n},
#' by solving the score equations of the log-likelihood \eqn{l(B)}.
#' We denote the score function by \eqn{U(B)} and the \eqn{PJ \times PJ}{PJxPJ}
#' Fisher information matrix by \eqn{A = A(B)}.
#'
#' The order \eqn{n^{-1}}{n^-1} bias of estimates based on the usual likelihood \eqn{L(B)}
#' is removed by applying the penalty \eqn{|A|^{1/2}}, and basing estimation on the
#' penalized likelihood \eqn{L^*(B) = L(B) |A|^{1/2}}.
#' The vector \eqn{\hat{B}^*} of penalized estimates (PLEs) is the solution to
#' the score equations of the penalized log-likelihood \eqn{l^*(B) = l(B) + \frac{1}{2} \log{|A|}}{l^*(B) = l(B) + 1/2 log{|A|}}.
#' The introduction of bias into the score function through the penalty removes the leading
#' term in the asymptotis bias of the MLEs.
#' The modified score function proposed by Firth for the binomial logistic model extends directly
#' to the multinomial model as \eqn{U^*(B) = U(B) - A b(B)}.
#' The bias term \eqn{b(B)} is the leading term in the asymptotic bias of the multinomial MLEs,
#' obtained from the Taylor series expansion of the log-likelihood (Cox and Snell, 1968)
#' and is a function of the matrix of third derivatives of \eqn{l(B)} with respect to \eqn{B}
#' (Bull et al., 2002).
#'
#' The PLEs are obtained by a modified Fisher scoring algorithm.  Using \eqn{t} to
#' denote the iteration number, the modified iterative updating equations are:
#'  \deqn{B^*_{(t+1)} = B^*_{(t)} + A^{-1}_{(t)}U^*(B^*_{(t)}) = B^*_{(t)} + b(B^*_{(t)}) + A^{-1}_{(t)}U(B^*_{(t)})}{B^* [(t+1)] = B^* [(t)] + A^-1 [(t)] U^*(B^* [(t)]) = B^*_[(t)] + b(B^* [(t)]) + A^-1 [(t)]U(B^* [(t)])}
#'  Thus, in comparison to the usual updating equation used to obtain the MLEs,
#'  the score function modification operates by applying the asymptotic bias correction at each step in the iterative process.
#'  This prevents estimates from going off to infinity and failing to converge when there is separation in the data.
#'
#'  Symmetric Wald-type CIs for \eqn{\beta_{jp}}{\beta[jp]} can be constructed using \eqn{Var^{* 1/2}(\hat{\beta}^*_{jp})}{Var^{* 1/2}(hat{\beta}^* [jp])},
#'  for \eqn{p = 1, ..., P} and \eqn{j = 1, ..., J}, obtained from the inverse of \eqn{A^*};
#'  however, performance is expected to be poor in situations, where separation is likely to occur.
#'  Asymmetric CIs for the PLEs can be constructed from the profile log-likelihood for
#'  \eqn{\beta_{jp}}{\beta[jp]}, which is the function \eqn{l^*_0(B(s))}{l^*[0](B(s))}, where \eqn{B(s)}
#'  is the argument that maximizes \eqn{l^*} under the single-parameter constraint
#'  \eqn{H_0: \beta_{jp} = s}{H[0]: \beta[jp] = s}.  The \eqn{100(1 - \alpha)\%}{100(1 - \alpha)\%} CI for \eqn{\beta_{jp}}{\beta[jp]}
#'  is given by all parameter values that are compatible with the data
#'  (i.e., all \eqn{s} such that the likelihood ratio statistic \eqn{LR_P(s) \le q}{LR[P](s) <= q},
#'  where \eqn{q} is the \eqn{1 - \alpha} percentile of the \eqn{\chi^2} distribution).
#'  This is equivalent to \eqn{l^*_0(B(s)) \ge l^*(\hat{B}^*) - \frac{1}{2} q}{l^*[0](B(s)) >= l^*(\hat{B}^*) - (1/2)q}.
#'  The endpoints of the interval are then found by numerically solving the equality for
#'  values of \eqn{s}.  Based on the algorithm employed in SAS PROC LOGISTIC for MLEs,
#'  our method for finding these roots does not require computing \eqn{l^*_0(B(s))}{l^*[0](B(s))},
#'  which in itself would involve maximizing \eqn{l^*(B)},
#'  but proceeds directly by solving the constrained optimization problem:
#'  maximize \eqn{l^*(B)} such that \eqn{l^*_(B) = l^*(\hat{B}^*) - \frac{1}{2} q}{l^*_(B) = l^*(\hat{B}^*) - (1/2)q} and \eqn{\beta_{jp} = s}{\beta_[jp] = s}.
#'  We, however, modify the starting values in the iterative scheme used by SAS to obtain a
#'  new algorithm that is slower, but simple and more robust (see Bull et al. 2007 for details).
#'
#' @return
#' \code{pmlr} returns an object of class \dQuote{\code{pmlr}}, a list with the following components:
#'   \item{call}{the matched call.}
#'   \item{method}{a character string indicating the method used to carry out hypothesis testing and/or confidence limit calcuation.}
#'   \item{separation}{an array indicating coefficients for which separation occured;
#'   \code{NA} returned in the absense of separation.}
#'   \item{converged}{a logical value indicating whehter the fitting algorithm judged to have converged.}
#'   \item{coefficients}{an array containing the coefficients of the \eqn{p} parameters for the \eqn{J} categories.}
#'   \item{var}{an array containing the variance-covariance matrices for the \eqn{J} categories. If \code{penalized=TRUE}, \code{var} is obtained based on \eqn{A^*}, the information matrix for the PLEs.}
#'   \item{CI.calculate}{a logical value whether the confidence limits were computed.}
#'   \item{CI.lower}{returned only if \code{CI.calculate=TRUE}: an array containing the lower confidence limits from the individual parameter tests.}
#'   \item{CI.upper}{returned only if \code{CI.calculate=TRUE}: an array containing the upper confidence limits from the individual parameter tests.}
#'   \item{statistic}{an array containing the test statistics from the individual parameter tests.}
#'   \item{pvalue}{an array containing the p-values from the individual parameter tests.}
#'   \item{logLik}{the value of the log-likelihood function for the fitted model (under no constraints).}
#'   \item{df}{the degrees of freedom, i.e., the number of estimated parameters in the model.}
#'   \item{converged.H0}{an array containing the logical values whether the fitting algorithm for the null model for each inividual parameter is jusdged to have converged.}
#'   \item{logLik.H0}{an array containing the value of the log-likelihood function for the fitted model under the null hypothesis for each individual parameter.}
#'   \item{joint}{a logical value indicating whether the joint hypothesis tests were performed.}
#'   \item{beta0all0}{When a joint likelihood test of all betas = 0 is called, the estimated betas are provided.  This is for information only and not displayed in the output. Not returned when \code{method="wald"}.}
#'   \item{var0all0}{When a joint likelihood test of all betas = 0 is called, the estimated variances are also provided.  This is for information only and not displayed in the output. Not returned when \code{method="wald"}.}
#'   \item{beta0allequal}{same as \code{beta0all0} except for the null hypothesis that all betas are equal.}
#'   \item{var0allequal}{same as var0all0 except for the null hypothesis that all betas are equal.}
#'   \item{beta0proportion}{same as beta0all0 except for the null hypothesis that all betas are proportional.}
#'   \item{var0proportion}{same as var0all0 except for the null hypothesis that all betas are proportional.}
#'   \item{logLik.C}{array contatining the values of log likelihood function under constraints: betas = 0; all betas are equal; and betas are proportional. This is for information only and not displayed in the output.}
#'   \item{joint.test.all0}{a list containing the following three components evaluated for the constrained hypothesis that all betas = 0.}
#'   \item{joint.test.all0$h0}{a character string describing the constrained hypothesis}
#'   \item{joint.test.all0$converged}{an array contatining whether the fitting algorithm achieved convergence under the constrained hypothesis for each term.}
#'   \item{joint.test.all0$test.h0}{an array contatining the test statistics and p-values from constrained hypothesis tests that all betas = 0.}
#'
#'   \item{joint.test.allequal}{same as \code{'joint.test.all0'} except for that all betas are equal, and for the additional component \code{'test.all0.vs.constraint'} (see below).}
#'   \item{joint.test.allequal$test.all0.vs.constraint}{a data frame containing the test statistics and p-values obtained from comparing the likelihoods for
#'   all betas are equal vs. all betas = 0.}
#'   \item{joint.test.proportion}{same as \code{'joint.test.allequal'} except for that the constrained hypothesis
#'   is that betas are proportional.}
#'
#' @note This implementation is not a true scoring or Newton-type algorithm because
#' it updates with the inverse of \eqn{A}, the Fisher information matrix for the MLEs,
#' rather than the information for the PLEs, \eqn{A^*}, which includes an additional term
#' corresponding to the second derivatives of the penalty: \eqn{\frac{1}{2} \log |A|}{(1/2)log(|A|)}.
#' As a result, convergence using the modified scoring algorithm for the PLEs is slower
#' than a scoring algorithm based on \eqn{A^*}, which converges at a quadratic rate.
#' For well behaved and larger datasets this usually means no more than 2 or 3 steps
#' beyond that required for the MLEs.  Starting values of \eqn{\beta{pj} = 0}{\beta[pj] = 0} are used
#' and are generally satisfactory.  For smaller datasets (i.e., less than 50 observations)
#' and especially datasets in which there are infinite MLEs, convergence is slower
#' and could take up to 35 or 40 iterations.  In datasets with separations, starting values
#' other than 0 can lead to divergence problems.
#'
#' The first argument to the \code{pmlr} function is a formula of the form \code{response ~ terms}
#' where \code{response} can be either a \eqn{J}-column indicator matrix or a factor with
#' \eqn{J + 1} levels.  In the case of frequency data (e.g., the \code{hepatitis} dataset)
#' the baseline category is determined by the \eqn{J}-column indicator matrix with baseline
#' category 0.  In the case of data with individual records (e.g., the \code{enzymes} dataset)
#' the baseline category is determined by the outcome, which is coded as a factor.
#' The first level (i.e., \code{levels(x)[1]} is taken to be the baseline.
#' Note that the default levels attribute of a factor is the unique set of values taken
#' \code{as.character(x)}, sorted into increasing order of \code{x}.
#' For example, the enzymes dataset has categories 1, 2, and 3 with baseline category 1.
#' The baseline category can be changed to category 2 via
#' \code{enzymes$Group <- factor(enzymes$Group, levels-c("2","1","3"))}.
#'
#' @references Bull, S. B., Greenwood, C. M. T., and Hauck, W. W. (1997)
#' Jacknife bias reduction for polychotomous logistic regression. \emph{Statistics in Medicine},
#' \bold{16}, 545--560; Bull, S. B. (1997) Correction.
#' \emph{Statistics in Medicine}, \bold{16}, 2928.
#'
#' Bull, S. B., Mak, C. and Greenwood, C. M. T. (2002) A modified score function estimator
#' for multinomial logistic regression in small samples.
#' \emph{Computational Statistics & Data Analysis}, \bold{39}, 57--74.
#'
#' Bull, S. B., Lewinger, J. P., Lee, S. S. F. (2005) Penalized maximum likelihood estimation for multinomial logistic regression using the Jeffreys prior.  \emph{Technical Report No. 0505}, Department of Statistics, University of Toronto.
#'
#' Bull, S. B., Lewinger, J. P. and Lee, S. S. F. (2007)
#' Confidence intervals for multinomial logistic regression in sparse data.
#' \emph{Statistics in Medicine}, \bold{26}, 903--918.
#'
#' Cox, D. R. and Snell, E. J. (1968)
#' A general definition of residuals.
#' \emph{Journal of the Royal Statistical Society, Series B}, \bold{30}, 248--275.
#'
#' Firth, D. (1993)
#' Bias reduction of maximum likelihood estimates.
#' \emph{Biometrika}, \bold{80}, 27--38.
#'
#' Lesaffre, E. and Albert, A. (1989)
#' Partial separation in logistic discrimination.
#' \emph{Journal of the Royal Statistical Society, Series B}, \bold{51}, 109--116.
#'
#' SAS Institute Inc. (1999)
#' \emph{SAS OnlineDoc, Version 8, The LOGISTIC Procedure},
#' Confidence Intervals for Parameters, Chapter 39, Section 26, Cary, NC.
#'
#' @examples
#' data(hepatitis)
#'
#' # As reported in Bull et al. (2007)
#' fit <- pmlr(cbind(HCV, nonABC) ~ group + time + group:time, data = hepatitis,
#' weights = counts, method="score")
#' summary(fit)
#'
#' data(enzymes)
#' # Exclude patients in Group 4 (post-necrotic cirrhosis)
#' enzymes <- enzymes[enzymes$Group != 4,]
#'
#' # Center and scale covariates
#' AST <- scale(log(enzymes$AST))
#' ALT <- scale(log(enzymes$ALT))
#' GLDH <- scale(log(enzymes$GLDH))
#' OCT <- scale(log(enzymes$OCT))
#' enzymes <- data.frame(Patient = enzymes$Patient,
#'                       Group = enzymes$Group, AST, ALT, GLDH, OCT)
#'
#' # Remove 10 observations to create separation
#' enzymes <- enzymes[-c(9, 18, 33, 58, 61, 77, 94, 97, 99, 100),]
#'
#' # Multinomial: Acute viral hepatitis and aggressive chronic hepatits
#' # vs. persistent chronic hepatitis
#' # Assign Group 2 (persistent chronic hepatitis) as baseline category
#' enzymes$Group <- factor(enzymes$Group, levels=c("2","1","3"))
#' fit <- pmlr(Group ~ AST + GLDH, data = enzymes, method="wald", CI.calculate=TRUE)
#' summary(fit)
#'
#' # Binomial: Acute viral hepatitis vs. persistent chronic hepatitis
#' # Exclude patients in Group 3 (agressive chronic hepatitis)
#' enzymes.1vs2 <- enzymes[enzymes$Group != 3,]
#' # Assign Group 2 (persistent chronic hepatitis) as baseline category
#' enzymes.1vs2$Group <- factor(enzymes.1vs2$Group, levels=c("2","1"))
#' fit <- pmlr(Group ~ AST + GLDH, data = enzymes.1vs2, method="none")
#' summary(fit)
#'
#' # Binomial: Aggressive chronic hepatitis vs. persistent chronic hepatitis
#' # Exclude patients in Group 1 (acute viral hepatitis)
#' enzymes.3vs2 <- enzymes[enzymes$Group != 1,]
#' # Assign Group 2 (persistent chronic hepatitis) as baseline category
#' enzymes.3vs2$Group <- factor(enzymes.3vs2$Group, levels=c("2","3"))
#' fit <- pmlr(Group ~ AST + GLDH, data = enzymes.3vs2, method="none")
#' summary(fit)
#'
#' @seealso \code{\link{summary.pmlr}}
#'
#' @keywords models
#' @keywords regression
#' @keywords htest
#' @export
#-------------------------------------------------

pmlr <- function(formula, data, weights = NULL, penalized = TRUE,
                 method = c("likelihood", "wald", "score", "none")[1], joint = FALSE,
                 CI.calculate = FALSE, CI.alpha = 0.05,
                 tol = 1e-6, verbose=FALSE) {

  if(!method %in% c("likelihood","score","wald","none")) stop("invalid method: please select a valid one")

  ret <- list()
  # change this!
  if (penalized) useAstar.wald <- TRUE
  else useAstar.wald <- FALSE

  useAstar.LRCI <- FALSE #check this

  # check the data and update the formula


  call <- match.call()
  mf <- match.call(expand.dots = FALSE)
  m <- match(c("formula", "data", "weights"), names(mf), 0)
  mf <- mf[c(1, m)]
  mf$drop.unused.levels <- TRUE
  mf[[1]] <- as.name("model.frame")
  mf <- eval(mf, parent.frame())
  mt <- attr(mf, "terms")

  y <- model.response(mf); if (is.factor(y)) y <- indicator.matrix(y)
  x <- model.matrix(mt, mf)
  J0 <- ncol(y)

  # check if there are no observations for
  # response matrix:
  exclude.cols.y <- exclude.cols.x <- NULL
  for(y.i in 1:ncol(y)){
    exclude.cols.y <- c(exclude.cols.y, var(y[,y.i],na.rm=T)==0)
  }
  if( any(exclude.cols.y) ) y <- y[,!exclude.cols.y,drop=FALSE]

  for(x.i in 1:ncol(x)){
    exclude.cols.x <- c(exclude.cols.x, var(x[,x.i],na.rm=T)==0)
  }
  if( any(exclude.cols.x) ) x <- x[,c(1,which(!exclude.cols.x)),drop=FALSE]


  wt <- as.vector(model.weights(mf)); if (is.null(wt)) wt <- rep(1, times = dim(y)[1])
  p <- ncol(x) # p = number of covariates
  J <- ncol(y) # J =  (response-categories - 1) = updated

  # removed
  #if( ((J0 >= J) & (J < 2)) && (joint) ) {
  if( (J < 2) && (joint) ) {
    warning("Sorry, it is not possible to perform joint hypothesis tests for binomial data.\n We will proceed without performing joint hypothesis tests (i.e., we set joint = FASLE)." )
    joint = FALSE
}

  if (penalized) {
    fit <- getPLEs(x, y, wt, tol=tol, verbose=verbose); B <- fit$B; B.inf <- NULL; Ainv <- fit$Ainv; Astarinv <- fit$Astarinv; l.max <- fit$lstar.max; fit.conv <- fit$conv
    if (verbose) { cat("GetPLEs Complete. \n") }
  } else {
    fit <- getMLEs(x, y, wt, tol=tol, verbose=verbose); B <- fit$B; B.inf <- fit$B.inf; Ainv <- fit$Ainv; l.max <- fit$l.max; fit.conv <- fit$conv; Astarinv <- NULL
    if (verbose) { cat("GetMLEs Complete. \n") }
  }

  separation <- array(data = NA, dim = c(1,p,J))
  dimnames(separation)<-list("", colnames(x), colnames(y))
  if (!penalized) {
    for (i in 1:J) {
      separation[1,,i] <- t(is.infinite(B.inf))[i,]
      separation[1,,i] <- ifelse(separation[1,,i], t(B.inf)[i,], NA)
    }
  }

  # parameter estimation
  coef <- array(data = NA, dim = c(1,p,J))
  dimnames(coef)<-list("", colnames(x), colnames(y))
  for (i in 1:J) coef[1,,i] <- t(B)[i,]
  # variance estimation
  var <- array(data = NA, dim = c(p,p,J))
  dimnames(var)<-list(colnames(x), colnames(x), colnames(y))
  for (i in 1:J) var[,,i] <- Ainv[seq(from = i, by = J, length = p), seq(from = i, by = J, length = p)]
  # profile-likleihood CI estimation
  if(CI.calculate) {
    CI.lower <- CI.upper <- array(data = NA, dim = c(1,p,J))
    dimnames(CI.lower) <- dimnames(CI.upper) <-list("", colnames(x), colnames(y))
  }
  # test
  stat <- pval <- array(data = NA, dim = c(1,p,J))
  dimnames(stat) <- dimnames(pval) <- list("", colnames(x), colnames(y))

  if (joint){
      ## JS Added 2015-12-15
      ## to create the following objects only when performing the joint test
      beta0all0 <- var0all0 <- beta0allequal <- var0allequal <- beta0proportion <- var0proportion <- array(data = NA, dim = c(1,p,J))
      dimnames(beta0all0) <- dimnames(var0all0) <- dimnames(beta0allequal) <- dimnames(var0allequal) <- dimnames(beta0proportion) <- dimnames(var0proportion) <-
          list("", colnames(x), colnames(y))
      ## stat.all0.vs.constraint and pval.all0.vs.constraint need not be evalueated
      ## under all0-constrianed hypothesis - "H_0: b_{i,1} = b_{i,2} = ... = b_{i,J} = 0"
      stat.all0.vs.constraint <- pval.all0.vs.constraint <- stat.joint <- pval.joint <- l0.joint <- array(data = NA, dim = c(1,p,3))
      dimnames(stat.all0.vs.constraint) <- dimnames(pval.all0.vs.constraint) <- dimnames(stat.joint) <- dimnames(pval.joint) <- dimnames(l0.joint) <-
          list("", colnames(x),
               c("H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} = 0",
                 "H_0: b_{p,1} = b_{p,2} = ... = b_{p,J}",
                 "H_0: b_{p,J} = J*b_{p,1}, b_{p,J-1} = (J-1)*b_{p,1}, ... , b_{p,2} = 2*b_{p,1}"))

      stat.all0.vs.constraint <- pval.all0.vs.constraint <- pval.all0.vs.constraint[,,-1,drop=FALSE]
      test.h0.all0 <- test.h0.allequal <- test.h0.proportion <- list()
  }

  converged <- NA
  if (method == "likelihood") {
    ## Likelihood ratio test for H_0: b_{p,j} = 0 (p-th covariate, jth category)
    testRun <- test.LR(x, y, wt, mt, B, h0 = 1, penalized, tol = tol, verbose=verbose);
    if( !any(testRun$conv) )
      warning("test.LR: algorithm did not converge\n")
    if (verbose) { cat("test.LR Complete. \n") }
    for (i in 1:J) {
      stat[1,,i] <- t(testRun$statistic)[i,]
      pval[1,,i] <- t(testRun$pvalue)[i,]
    }
    test.unconstrained <- list()
    test.unconstrained$statistic <- stat
    test.unconstrained$pvalue <- pval
    test.unconstrained$logLik.H0 <- NA
    test.unconstrained$converged <- testRun$conv
    if(J==1){
      #if J > 2 - individual l0 is rather meaningless
      #cat("J==1 \n")
      l0 <- array(data = NA, dim = c(1,p,J))
      dimnames(l0) <- list("", colnames(x), colnames(y))
      l0[,,1] <- testRun$l0
      test.unconstrained$logLik.H0 <- l0
    }

    if(CI.calculate){
      # Profile confidence intervals - only when method = "likelihood" ??
      profileCI.lower <- profileCIs(x, y, wt, B, B.inf, side = -1, CI.alpha, l.max, step = 0.05,
                                    useAstar = useAstar.LRCI, penalized, tol = tol, verbose=verbose)
      if (verbose) { cat("profileCI.lower Complete. \n") }
      profileCI.upper <- profileCIs(x, y, wt, B, B.inf, side = 1, CI.alpha, l.max, step = 0.05,
                                    useAstar = useAstar.LRCI, penalized, tol = tol, verbose=verbose)
      if (verbose) { cat("profileCI.upper Complete. \n") }
      for (i in 1:J) {
        CI.lower[1,,i] <- t(profileCI.lower)[i,]
        CI.upper[1,,i] <- t(profileCI.upper)[i,]
      }
    }#if(profileCI.calculate) ends
    if (joint) {
      # h0=2: Likelihood ratio test for H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} = 0 (p-th covariate)
      testRun <- test.LR(x, y, wt, mt, B, h0 = 2, penalized, tol = tol, verbose=verbose);

      test.h0.colnames = c("ChiSq","Pr(>ChiSq)")
      test.h0.all0$h0 <- "H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} = 0 (p-th covariate)"
      test.h0.all0$converged <- testRun$conv
      test.h0.all0$test.h0 <- matrix(NA,ncol=length(test.h0.colnames),nrow=p)
      dimnames(test.h0.all0$test.h0) <- list( colnames(x), test.h0.colnames )
      #test.h0.all0$test.h0[,"logLik"] <-
      l0.joint[,,1] <- t(testRun$l0)
      test.h0.all0$test.h0[,"ChiSq"] <- t(testRun$statistic)
      test.h0.all0$test.h0[,"Pr(>ChiSq)"] <- t(testRun$pvalue)
      beta0all0 <- testRun$beta0.array
      var0all0 <- testRun$var0.array

      ## temporary -- TAKE OUT LATER -- DOES NOT WORK
      ## NEED TO RETURN "ARRAY" of p matrices
      ret$Ainv0all0 <- testRun$Ainv0.array

      # h0=3: Likelihood ratio test for H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} (p-th covariate)
      testRun <- test.LR(x, y, wt, mt, B, h0 = 3, penalized, tol = tol, verbose=verbose);
      beta0allequal <- testRun$beta0.array
      var0allequal <- testRun$var0.array
      test.h0.allequal$h0 <- "H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} (p-th covariate)"
      test.h0.allequal$converged <- testRun$conv
      test.h0.colnames <- c("ChiSq","Pr(>ChiSq)")
      test.h0.allequal$test.all0.vs.constraint <-
        test.h0.allequal$test.h0 <- matrix(NA,ncol=length(test.h0.colnames),nrow=p)
      dimnames(test.h0.allequal$test.all0.vs.constraint) <- dimnames(test.h0.allequal$test.h0) <- list(colnames(x), test.h0.colnames)
      #test.h0.allequal$test.h0[,"logLik"] <-
      l0.joint[,,2] <- t(testRun$l0)
      test.h0.allequal$test.h0[,"ChiSq"] <- t(testRun$statistic)
      test.h0.allequal$test.h0[,"Pr(>ChiSq)"] <- t(testRun$pvalue)
      chisq <- -2*(l0.joint[,,1] - t(testRun$l0))
      test.h0.allequal$test.all0.vs.constraint[,"ChiSq"] <- chisq
      test.h0.allequal$test.all0.vs.constraint[,"Pr(>ChiSq)"] <- pchisq(chisq, df=1, lower.tail=F)

      ## temporary -- TAKE OUT LATER
      ret$Ainv0allequal <- testRun$Ainv0.array

      # h0=4: Likelihood ratio test for proportionality: H_0: b_{p,J} = J*b_{p,1}, b_{p,J-1} = (J-1)*b_{p,1}, ... , b_{p,2} = 2*b_{p,1} (p-th covariate)
      if (J >= 2) {
        testRun <- test.LR(x, y, wt, mt, B, h0 = 4, penalized, tol = tol, verbose=verbose);
        l0.joint[,,3] <- t(testRun$l0)
        test.h0.proportion$h0 <- "H_0: b_{p,J} = J*b_{p,1}, b_{p,J-1} = (J-1)*b_{p,1}, ... , b_{p,2} = 2*b_{p,1} (p-th covariate)"
        test.h0.proportion$converged <- testRun$conv
        test.h0.colnames <- c("ChiSq","Pr(>ChiSq)")
        test.h0.proportion$test.all0.vs.constraint <-
          test.h0.proportion$test.h0 <- matrix(NA,ncol=length(test.h0.colnames),nrow=p)
        dimnames(test.h0.proportion$test.all0.vs.constraint) <-
          dimnames(test.h0.proportion$test.h0) <-
          list(colnames(x), test.h0.colnames)

        test.h0.proportion$test.h0[,"ChiSq"] <- t(testRun$statistic)
        test.h0.proportion$test.h0[,"Pr(>ChiSq)"] <- t(testRun$pvalue)
        chisq <- -2*(l0.joint[,,1] - l0.joint[,,3])
        test.h0.proportion$test.all0.vs.constraint[,"ChiSq"] <- chisq
        test.h0.proportion$test.all0.vs.constraint[,"Pr(>ChiSq)"] <- pchisq(chisq, df=1, lower.tail=F)
        beta0proportion <- testRun$beta0.array
        var0proportion <- testRun$var0.array

        ## temporary -- TAKE OUT LATER
        ret$Ainv0proportion <- testRun$Ainv0.array

      }
      else {#if (J < 2) ?? no test ??
        #test.h0.proportion$test.h0[,"logLik"] <- NA
        test.h0.proportion$test.h0[,"ChiSq"] <- NA
        test.h0.proportion$test.h0[,"Pr(>ChiSq)"] <- NA
        test.h0.proportion$test.all0.vs.constraint[,"ChiSq"] <- NA
        test.h0.proportion$test.all0.vs.constraint[,"Pr(>ChiSq)"] <- NA
      }
      if (verbose) { cat("test.LR (joint) Complete. \n") }
    }
  }

  ## --------------- REMOVE THIS LATER!!! --------------- ##
  #
  # for investigating the dimension of the problem
  ret$Ainv <- Ainv
  ret$Astarinv <- Astarinv
  ## --------------- REMOVE THIS LATER!!! --------------- ##

  #*** NEED TO THINK ABOUT THIS
  if (useAstar.wald) Ainv <- Astarinv# ** when is it being used?
  # Wald test
  if (method == "wald") {
    ## Wald test for H_0: b_{p,j} = 0 (p-th covariate, jth category)
    for (i in 1:J) {
      stat[,,i] <- (coef[,,i]/sqrt(diag(var[,,i])))^2
      pval[,,i] <- pchisq(stat[,,i], df = 1, lower.tail = FALSE)
    }
    if(CI.calculate){
      CI.lower <-  CI.upper <- array(data = NA, dim = c(1,p,J))
      dimnames(CI.lower) <- dimnames(CI.upper) <-list("", colnames(x), colnames(y))
      for (i in 1:J) {
        CI.lower[,,i] <- coef[,,i] - qnorm(p = 1 - CI.alpha/2) * sqrt(diag(var[,,i]))
        CI.upper[,,i] <- coef[,,i] + qnorm(p = 1 - CI.alpha/2) * sqrt(diag(var[,,i]))
      }
    }
    test.unconstrained <- list()
    test.unconstrained$statistic <- stat
    test.unconstrained$pvalue <- pval

    if (joint) {
      # Wald test for H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} = 0 (p-th covariate)
      for (i in 1:p) {
        C <- matrix(data = 0, nrow = J, ncol = p * J); C[1:J,(((i - 1) * J) + 1):(i * J)] <- diag(J)
        stat.joint[1,i,1] <- test.wald(vec(t(B)), Ainv, C)
        pval.joint[1,i,1] <- pchisq(stat.joint[1,i,1], df = J, lower.tail = FALSE)
      }
      test.h0.all0$h0 <- "H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} = 0 (p-th covariate)"
      test.h0.colnames <- c("ChiSq","Pr(>ChiSq)")
      test.h0.all0$test.h0 <- matrix(NA,ncol=length(test.h0.colnames),nrow=p)
      dimnames(test.h0.all0$test.h0) <- list(colnames(x), test.h0.colnames)
      test.h0.all0$test.h0[,"ChiSq"] <- stat.joint[,,1]
      test.h0.all0$test.h0[,"Pr(>ChiSq)"] <- pval.joint[,,1]

      # Wald test for H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} for the p-th covariate
      for (i in 1:p) {
        C <- matrix(data = 0, nrow = J-1, ncol = p *J); C[1:(J - 1), ((i - 1) * J + 1):((i * J) - 1)] <- diag(J - 1); for (j in (1:J - 1)) C[j,((i - 1) * J) + 1 + j] <- (-1)
        stat.joint[1,i,2] <- test.wald(vec(t(B)), Ainv, C)
        pval.joint[1,i,2] <- pchisq(stat.joint[1,i,2], df = J - 1, lower.tail = FALSE)
      }
      test.h0.allequal$h0 <- "H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} (p-th covariate)"
      test.h0.allequal$test.h0 <- matrix(NA,ncol=length(test.h0.colnames),nrow=p)
      dimnames(test.h0.allequal$test.h0) <- list(colnames(x), test.h0.colnames)
      test.h0.allequal$test.h0[,"ChiSq"] <- stat.joint[,,2]
      test.h0.allequal$test.h0[,"Pr(>ChiSq)"] <- pval.joint[,,2]

      # Wald test for proportionality, H_0: b_{p,J} = J*b_{p,1}, b_{p,J-1} = (J-1)*b_{p,1}, ... , b_{p,2} = 2*b_{p,1} (p-th covariate)
      if (J >= 2) {
        for (i in 1:p) {
          C <- matrix(data = 0, nrow = J-1, ncol = p *J)
          C[1:(J - 1), ((i - 1) * J + 1):((i * J) - 1)] <- diag(J - 1);
          for (j in (1:J - 1)) C[j,((i - 1) * J) + 1 + j] <- (-j/(j+1))
          stat.joint[1,i,3] <- test.wald(vec(t(B)), Ainv, C)
          pval.joint[1,i,3] <- pchisq(stat.joint[1,i,3], df = J - 1, lower.tail = FALSE)
        } }
      else {
        stat.joint[1,,3] <- NA
        pval.joint[1,,3] <- NA
      }
      test.h0.proportion$h0 <- "H_0: b_{p,J} = J*b_{p,1}, b_{p,J-1} = (J-1)*b_{p,1}, ... , b_{p,2} = 2*b_{p,1} (p-th covariate)"
      test.h0.proportion$test.h0 <- matrix(NA,ncol=length(test.h0.colnames),nrow=p)
      dimnames(test.h0.proportion$test.h0) <- list(colnames(x), test.h0.colnames)
      test.h0.proportion$test.h0[,"ChiSq"] <- stat.joint[,,3]
      test.h0.proportion$test.h0[,"Pr(>ChiSq)"] <- pval.joint[,,3]

      if (verbose) { cat("Wald Tests (joint) Complete. \n") }
    }#if(joint) ends
  }

  ####------------------------------ NEED TO FIX!!! ------------------------------####
  if (method == "score") {
    ## Score test for H_0: b_{p,j} = 0 (p-th covariate, jth category)
    ## carry out estimation and score test undr the unconstrained hypothesis when the user intends to
    testRun <- test.score(x, y, wt, mt, B, h0 = 1, penalized, verbose=verbose);
    for (i in 1:J) {
      stat[1,,i] <- t(testRun$statistic)[i,];
      pval[1,,i] <- t(testRun$pvalue)[i,];
    }
    test.unconstrained <- list()
    test.unconstrained$statistic <- stat
    test.unconstrained$pvalue <- pval
    test.unconstrained$converged <- testRun$conv

    if(J==1){
      #if J > 2 - individual l0 is rather meaningless
      #cat("J==1 \n")
      l0 <- array(data = NA, dim = c(1,p,J))
      dimnames(l0) <- list("", colnames(x), colnames(y))
      l0[,,1] <- testRun$l0
      test.unconstrained$logLik.H0 <- l0
    }

    ##***NEW - added by JS (Oct.28, 2015)
    if( !any(testRun$conv) )
      warning("test.score: algorithm did not converge\n")
    if (verbose) { cat("Score Tests Complete. \n") }

    if (joint) {
      ## Score test for H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} = 0 (p-th covariate)
      testRun <- test.score(x, y, wt, mt, B, h0 = 2, penalized, verbose = verbose)
      test.h0.all0$h0 <- "H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} = 0 (p-th covariate)"
      test.h0.all0$converged <- testRun$conv
      test.h0.colnames <- c("ChiSq","Pr(>ChiSq)")
      test.h0.all0$test.h0 <- matrix(NA,ncol=length(test.h0.colnames),nrow=p)
      dimnames(test.h0.all0$test.h0) <- list(colnames(x), test.h0.colnames)
      #test.h0.all0$test.h0[,"logLik"] <-
      l0.joint[,,1] <- t(testRun$l0)
      test.h0.all0$test.h0[,"ChiSq"] <- t(testRun$statistic)
      test.h0.all0$test.h0[,"Pr(>ChiSq)"] <- t(testRun$pvalue)
      beta0all0 <- testRun$beta0.array
      var0all0 <- testRun$var0.array
      if(!any(testRun$conv))
        warning("test.score: algorithm did not converge - joint test for all beta=0 \n")

      ## Score test for H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} (p-th covariate)
      testRun <- test.score(x, y, wt, mt, B, h0 = 3, penalized, verbose = verbose);
      test.h0.allequal$h0 <- "H_0: b_{p,1} = b_{p,2} = ... = b_{p,J} (p-th covariate)"
      test.h0.allequal$converged <- testRun$conv
      test.h0.colnames <- c("ChiSq","Pr(>ChiSq)")
      test.h0.allequal$test.all0.vs.constraint <-
        test.h0.allequal$test.h0 <- matrix(NA,ncol=length(test.h0.colnames),nrow=p)
      dimnames(test.h0.allequal$test.all0.vs.constraint) <-
        dimnames(test.h0.allequal$test.h0) <- list(colnames(x), test.h0.colnames)
      l0.joint[,,2] <- t(testRun$l0)
      test.h0.allequal$test.h0[,"ChiSq"] <- t(testRun$statistic)
      test.h0.allequal$test.h0[,"Pr(>ChiSq)"] <- t(testRun$pvalue)
      chisq <- -2*(l0.joint[,,1] - l0.joint[,,2])
      test.h0.allequal$test.all0.vs.constraint[,"ChiSq"] <- chisq
      test.h0.allequal$test.all0.vs.constraint[,"Pr(>ChiSq)"] <- pchisq(chisq, df=1, lower.tail=F)
      beta0allequal <- testRun$beta0.array
      var0allequal <- testRun$var0.array

      if(!any(testRun$conv))
        warning("test.score: algorithm did not converge - joint test for all beta equal \n")

      ## Score test for proportionality, H_0: b_{p,J} = J*b_{p,1}, b_{p,J-1} = (J-1)*b_{p,1}, ... , b_{p,2} = 2*b_{p,1} (p-th covariate)
      if (J >= 2) {
        testRun <- test.score(x, y, wt, mt, B, h0 = 4, penalized, verbose = verbose);
        test.h0.proportion$h0 <- "H_0: b_{p,J} = J*b_{p,1}, b_{p,J-1} = (J-1)*b_{p,1}, ... , b_{p,2} = 2*b_{p,1} (p-th covariate)"
        test.h0.proportion$converged <- testRun$conv
        test.h0.colnames <- c("ChiSq","Pr(>ChiSq)")
        test.h0.proportion$test.all0.vs.constraint <- test.h0.proportion$test.h0 <- matrix(NA,ncol=length(test.h0.colnames),nrow=p)
        dimnames(test.h0.proportion$test.all0.vs.constraint) <-
          dimnames(test.h0.proportion$test.h0) <- list(colnames(x), test.h0.colnames)
        l0.joint[,,3] <- t(testRun$l0)
        test.h0.proportion$test.h0[,"ChiSq"] <- t(testRun$statistic)
        test.h0.proportion$test.h0[,"Pr(>ChiSq)"] <- t(testRun$pvalue)
        chisq <- -2*(l0.joint[,,1]- l0.joint[,,3])
        test.h0.proportion$test.all0.vs.constraint[,"ChiSq"] <- chisq
        test.h0.proportion$test.all0.vs.constraint[,"Pr(>ChiSq)"] <- pchisq(chisq, df=1, lower.tail=F)
        beta0proportion <- testRun$beta0.array
        var0proportion <- testRun$var0.array
        if( !any(testRun$conv) )
          warning("test.score: algorithm did not converge - joint test for proportion beta \n")

      }
      else {
        #test.h0.proportion$test.h0[,"logLik"] <- NA
        test.h0.proportion$test.h0[,"ChiSq"] <- NA
        test.h0.proportion$test.h0[,"Pr(>ChiSq)"] <- NA
        test.h0.proportion$test.all0.vs.constraint[,"ChiSq"] <- NA
        test.h0.proportion$test.all0.vs.constraint[,"Pr(>ChiSq)"] <- NA
      }

      if (verbose) { cat("Score Tests (joint) Complete. \n") }
    }
  }#if(method == "score") ends


  if (method == "none")  { #No CIs or Hypothesis tests are performed.
    ret <- list(coefficients = coef, var = var, separation = separation, call = call, method = method, joint = joint)
  }
  else{ #(method != "none")
    ret$call = call
    ret$method = method
    ret$separation <- separation
    ret$converged = fit.conv
    ret$coefficients = coef
    ret$var = var
    ret$CI.calculate = CI.calculate
    if(CI.calculate){
      ret$CI.lower <- CI.lower
      ret$CI.upper <- CI.upper
    }
    ##test
    ret$statistic = test.unconstrained$statistic
    ret$pvalue = test.unconstrained$pvalue
    ret$logLik = l.max
    ret$df=length(colnames(x))*J ## log-likelihood under no constraints ** NEW
    ret$converged.H0 = test.unconstrained$converged
    ret$logLik.H0 = test.unconstrained$logLik.H0

    ret$joint = joint
    if(joint){
        if(method!="wald"){
            #not available for the Wald test
            ret$beta0all0 = beta0all0
            ret$var0all0 = var0all0

            ret$beta0allequal = beta0allequal
            ret$var0allequal = var0allequal

            ret$beta0proportion = beta0proportion
            ret$var0proportion = var0proportion

            ## Joint test results
            ret$logLik.joint = l0.joint
        }
        ## joint test more info
        ret$joint.test.all0 = test.h0.all0
        ret$joint.test.allequal = test.h0.allequal
        ret$joint.test.proportion = test.h0.proportion
    } #if(joint) ends

  }


  #attr(ret, "class") <- c("pmlr")
  class(ret) <- "pmlr"
  return(ret)
}