R/DiSSMod.R

#' Fitting Sample Selection Models for Discrete Response Variables
#'
#' Function \code{DiSSMod} fits sample selection models for discrete random
#' variables, by suitably extending the formulation of the classical
#' Heckman model to the case of a discrete response, but retaining the
#' original conceptual framework. Maximum likelihood estimates are obtained
#' by Newton-Raphson iteration combined with use of profile likelihood.
#'
#' @details
#' The specification of the two linear models regulating the response variable and
#' the selection mechanism, as indicated in the \sQuote{Background} section,
#' is accomplished by two arguments of \code{formula} type,
#' denoted \code{response} and \code{selection}, respectively.
#' Each \code{formula} is specified with the same syntax of similar arguments in
#' standard functions such as \code{lm} and \code{glm}, with the restriction that
#' the intercept term (which is automatically included) must not be removed.
#'
#' The distributional assumptions associated to the \code{response} and \code{selection} components
#' are specified by the arguments \code{resp.dist} and \code{select.dist}, respectively.
#' Argument \code{select.dist} refers to the unobservable continuous variable of which we
#' observe only the dichotomous outcome Yes-No.
#'
#' In this respect, a remark is appropriate about the option \code{"Gumbel"} for \code{select.dist}.
#' This choice is equivalent to the adoption of an Exponential distribution of the selection variables
#' combined  an exponential transformation of the linear predictor of the
#' \code{selection} argument, as it is presented in Section 3.2 of Azzalini et al. (2019).
#' Also, it corresponds to work with the log-transformation of an Exponential variable,
#' which is essentially a Gumbel type of variable, up to a linear transformation with
#' respect to its more commonly employed parameterization.
#'
#' When \code{resp.dist} is \code{"Poisson"} or \code{"NegBinomial"} and \code{trunc.num} is missing,
#' a default choice is made; this equals \code{1.5*m} or \code{2*m} in the two respective cases,
#' where \code{m} denotes the maximum observed value of the response variable.
#'
#' Function \code{DiSSMOd} calls lower level functions, \code{nr.bin, nr.nbinom, nr.pois} and the others
#' for the actual numerical maximization of the log-likelihood via a Newton-Raphson iteration.
#'
#' Notice that the automatic initialization of the \code{alpha} search interval, when this argument is
#' missing, may change in future versions of the package.
#'
#' @concept generalized linear models
#' @concept selection models
#' @concept sample selection
#' @concept Heckman model
#' @concept discrete response
#' @concept maximum likelihood
#' @param response a formula for the response equation.
#' @param selection a formula for the selection equation.
#' @param data a data frame and data has to be included with the form of \code{data.frame}.
#' @param resp.dist a character for the distribution choice of the response variable,
#' \code{"bernoulli"} for Bernoulli distribution, \code{"poisson"} for Poisson distribution,
#' and \code{"negbinomial"} for Negative binomial distribution. Also, the character strings
#' can be abbreviated and can be upper or lower case as preferred.
#' @param select.dist a character for the distribution choice of the selection variable,
#' \code{"gumbel"} for Gumbel distribution, \code{"normal"} for Normal distribution,
#' and \code{"logistic"} for Logistic distribution. Also, the character strings
#' can be abbreviated and can be upper or lower case as preferred.
#' @param alpha a vector of \eqn{alpha} values on which the profile log-likelihood function is evaluated;
#' if the argument is missing, a set of values in the interval \code{(-10, 10)} is used for the initial search,
#' followed by a second search on a revised interval which depends on the outcome from the first search.
#' @param trunc.num an integer numeric constant used as the truncation point of an infine summation of probabilities
#' involved when \code{resp.dist} equals \code{"Poisson"} or \code{"NegBinomial"};
#' if the argument is missing, a default choice is made, as described in the \sQuote{Details} section. Notice: this
#' default choice of \code{trunc.num} may be subject to revision in some future version of the package,
#' and the argument \code{trunc.num} itselt may possibly be replaced by some other ingredient.
#' @param standard a logical value for the standardizing explanatory variables, if \code{TRUE} two types of values
#' (standardized and not) will be returned.
#' @param verbose an integer value for the level of printed details (values: 0|1|2); the default value is 1
#' which stands for shortly printed details. If the value is 2, more details are viewed such as
#' values of the log likelihood functions and iteration numbers. If the value is 0, there is no printed
#' detail.
#' @param eps a numeric value for the estimating parameters, which is needed for the step of the optimization.
#' If the sum of absolute differences between present step estimated parameters and former step
#' estimated parameters is smaller than \code{eps}, we assume that estimated parameters are
#' optimized.
#' @param itmax an integer stands for maximum number for the iteration of optimizing the parameters.
#' @return \code{DiSSMod} returns an object of class \code{"DiSSMod"},
#' which is a list containing following components:
#' @return
#' \item{call}{a matched call.}
#' \item{standard}{a logical value, stands for standardization or not.}
#' \item{st_loglik}{a vector containing the differences between log likelihoods and maximized log likelihood.}
#' \item{max_loglik}{a maximized log likelihood value.}
#' \item{mle_alpha}{a maximized likelihood estimator of alpha.}
#' \item{alpha}{a vector containing grids of the alpha}
#' \item{Nalpha}{a vector containing proper alpha, which does not have
#' \code{NA} value for corresponding log likelihood.}
#' \item{num_NA}{a number of \code{NA} values of log likelihoods.}
#' \item{n_select}{a number of selected response variables.}
#' \item{n_all}{a number of all response variables.}
#' \item{estimate_response}{estimated values for the response model.}
#' \item{std_error_response}{estimated standard errors for the response model.}
#' \item{estimate_selection}{estimated values for the selection model.}
#' \item{std_error_selection}{estimated standard errors for the selection model.}
#' @importFrom MASS glm.nb
#' @importFrom sfsmisc posdefify
#' @importFrom matrixcalc is.positive.definite
#' @importFrom grDevices extendrange
#' @importFrom graphics abline plot points rug
#' @importFrom stats approxfun binomial contrasts dbinom
#' dexp dlogis dnbinom dnorm dpois glm model.matrix
#' model.response na.pass optim pexp plogis pnorm
#' poisson qchisq qnorm sd symnum uniroot
#' @import psych
#' @examples
#' set.seed(45)
#' data(DoctorRWM, package = "DiSSMod")
#' n0 <- 600
#' set.n0 <- sample(1:nrow(DoctorRWM), n0)
#' reduce_DoctorRWM <- DoctorRWM[set.n0,]
#' result0 <- DiSSMod(response = as.numeric(DOCVIS > 0) ~ AGE + INCOME_SCALE + HHKIDS + EDUC + MARRIED,
#'                    selection = PUBLIC ~ AGE + EDUC + FEMALE,
#'                    data = reduce_DoctorRWM, resp.dist="bernoulli", select.dist = "normal",
#'                    alpha = seq(-5.5, -0.5, length.out = 21), standard = TRUE)
#'
#' print(result0)
#'
#' data(CreditMDR, package = "DiSSMod")
#' n1 <- 600
#' set.n1 <- sample(1:nrow(CreditMDR), n1)
#' reduce_CreditMDR <- CreditMDR[set.n1,]
#' result1 <- DiSSMod(response = MAJORDRG ~ AGE + INCOME + EXP_INC,
#'                    selection = CARDHLDR ~ AGE + INCOME + OWNRENT + ADEPCNT + SELFEMPL,
#'                    data = reduce_CreditMDR, resp.dist="poi", select.dist = "logis",
#'                    alpha = seq(-0.3, 0.3,length.out = 21), standard = FALSE, verbose = 1)
#'
#' print(result1)
#'
#' @references Azzalini, A., Kim, H.-M. and Kim, H.-J. (2019) Sample selection
#' models for discrete and other non-Gaussian response variables.
#'  \emph{Statistical Methods & Applications}, \strong{28}, 27--56. First online 30 March 2018.
#' \url{https://doi.org/10.1007/s10260-018-0427-1}
#' @seealso The functions \code{\link[DiSSMod]{summary.DiSSMod}}, \code{\link[DiSSMod]{coef.DiSSMod}},
#' \code{\link[DiSSMod]{confint.DiSSMod}}, \code{\link[DiSSMod]{plot.DiSSMod}}
#'  are used to obtain and print a summary, coefficients, confidence interval and
#'  plot of the results.
#' @seealso The generic function \code{\link[stats]{logLik}} is used to obtain maximum log likelihood of the
#' result.
#' @seealso See also \code{\link[stats]{lm}}, \code{\link[stats]{glm}} and
#' \code{\link[stats]{formula}}.
#' @section Background:
#' Function \code{DiSSMod} fits sample selection models for discrete random variables,
#' by suitably extending the formulation of the classical Heckman model to the case of a discrete response,
#' but retaining the original conceptual framework.
#' This logic involves the following key ingredients: (1) a linear model indicating which explanatory variables
#' influence the response variable; (2) a linear model indicating  which (possibly different) explanatory variables,
#' besides the response variable itself, influence  a `selection variable', which is intrinsically continuous but
#' we only observe a dichotomous outcome from it, of type Yes-No, which selects which are the observed response cases;
#' (3) distributional assumptions on the response and the selection variable.
#'
#' The data fitting method is maximum likelihood estimation (MLE), which operates in two steps:
#' (i) for each given value of parameter \eqn{alpha} which regulates the level of selection,
#'  MLE is performed for all the remaining parameters, using a Newton-Raphson iteration;
#' (ii) a scan of the \eqn{alpha} axis builds the  profile log-likelihood function and
#'  its maximum point represents the overall MLE.
#'
#' A detailed account of the underlying theory and the operational methodology is provided by Azzalini et al. (2019).
#'
#' @export
DiSSMod <- function(response, selection, data, resp.dist, select.dist, alpha,
                    trunc.num, standard = FALSE, verbose = 1, eps = 1e-7,
                    itmax = 1e+3)
{
  cl <- match.call()
  mf <- match.call(expand.dots = FALSE)

  m1 <- match(c("response", "data", "subset", "offset"), names(mf), 0)
  mf1 <- mf[c(1, m1)]
  mf1$drop.unused.levels <- TRUE
  mf1[[1]] <- quote(model.frame)
  names(mf1)[2] <- "formula"
  mf1 <- eval(mf1, parent.frame())
  mt1 <- attr(mf1, "terms")

  m2 <- match(c("selection", "data", "subset"), names(mf), 0)
  mf2 <- mf[c(1, m2)]
  mf2$drop.unused.levels <- TRUE
  mf2$na.action <- na.pass
  names(mf2)[2] <- "formula"
  mf2[[1]] <- quote(model.frame)
  mf2 <- eval(mf2, parent.frame())
  mt2 <- attr(mf2, "terms")

  resp.dist <- match.arg(tolower(resp.dist), c("bernoulli", "poisson", "negbinomial"))
  select.dist <- match.arg(tolower(select.dist), c("gumbel", "normal", "logistic"))

  pre.search <- missing(alpha)

  y.out <- model.response(mf1, "numeric")
  u.sel <- model.response(mf2, "numeric")

  m <- max(y.out, na.rm = TRUE)
  if(missing(trunc.num)) {
    if(resp.dist=="poisson") {
      trunc.num <- 1.5*m
    } else if(resp.dist=="negbinomial") {
      trunc.num <- 2*m
    } else {
      trunc.num <- NULL }
  }

  n <- length(y.out)
  n1 <- sum(u.sel==1)

  yi <- y.out[u.sel==1]
  yi <- c(yi,rep(NaN,n-n1))

  x.mat <- model.matrix(mt1, mf1, contrasts)[,-1]
  w.mat <- model.matrix(mt2, mf2, contrasts)[,-1]

  x.name <- names(as.data.frame(x.mat))
  w.name <- names(as.data.frame(w.mat))

  x.mat <- apply(as.matrix(x.mat), 2, function(x, a, b){c(x[a],x[b])}, a = c(u.sel==1), b = c(u.sel==0) )
  w.mat <- apply(as.matrix(w.mat), 2, function(x, a, b){c(x[a],x[b])}, a = c(u.sel==1), b = c(u.sel==0) )

  if(standard){
    mean.x.vec <- apply(x.mat, 2, mean)
    mean.w.vec <- apply(w.mat, 2, mean)
    sd.x.vec <- apply(x.mat, 2, sd)
    sd.w.vec <- apply(w.mat, 2, sd)

    x.mat <- apply(as.matrix(x.mat), 2, function(x){(x-mean(x))/sd(x)} )
    w.mat <- apply(as.matrix(w.mat), 2, function(x){(x-mean(x))/sd(x)} )

    cm <- diag(1/c(1, sd.x.vec, 1, sd.w.vec))
    cm[1, c(2:(ncol(x.mat)+1))] <- c(-mean.x.vec/sd.x.vec)
    cm[(ncol(x.mat)+2), c((ncol(x.mat)+3):(ncol(x.mat)+ncol(w.mat)+2))] <- c(-mean.w.vec/sd.w.vec)
  }

  one <- rep(1,n)

  xij <- cbind(one,x.mat)
  wij <- cbind(one,w.mat)

  if(pre.search) {
    alpha <- seq(-10, 10, length=21)
    if(verbose > 0) cat("Search preliminary alpha interval (", format(range(alpha)), ")\n")
  }

  leng <- length(alpha)

  loglik <- rep(0,leng)
  theta_Finf <- NULL
  mi <- rep(1,n)

  if(verbose == 1) cat("Running up to", leng, ": ")
  for (i in 1:leng) {
    alpha0 <- alpha[i]
    theta0 <- NULL
    temp <- switch(resp.dist,
                   "bernoulli" = nr.bin(mi, xij, wij, theta0, yi, alpha0, select.dist, eps, itmax),
                   "poisson" = nr.pois(xij, wij, theta0, yi, alpha0, select.dist, eps, itmax, trunc.num),
                   "negbinomial" = nr.nbinom(xij, wij, theta0, yi, alpha0, select.dist, eps, itmax, trunc.num)
    )
    loglik[i] <- temp[[3]]
    theta_Finf <- c(theta_Finf,temp)
    if(verbose == 1) cat(i, "")
    if(verbose > 1) cat("i, alpha[i], logLik[i]:", i, format(alpha0), format(loglik[i]), "\n")
  }
  if(verbose == 1) cat("\n")
  numind <- which.max(loglik)
  if(pre.search) {
    alpha. <- alpha[numind]
    bad <- is.na(loglik) | (loglik[numind] - loglik > 15)
    shift <- loglik[numind] - qchisq(0.9, 1)
    if(sum(!bad) < 3) {
      message("This log-likelihood function takes on peculiar values.")
      if(verbose < 2) print(rbind(alpha, logLik=loglik))
      stop("I give up. Sorry.")
    }
    af <- approxfun(alpha[!bad], loglik[!bad] - shift, method="linear")
    vaf <- Vectorize(af)
    alpha1 <- try(uniroot(vaf, c(min(alpha[!bad]), alpha.)), silent = TRUE)
    alpha1 <- if(class(alpha1)=="list") alpha1$root else min(alpha)
    alpha2 <- try(uniroot(vaf, c(alpha., max(alpha[!bad]))), silent = TRUE)
    alpha2 <- if(class(alpha2)=="list") alpha2$root else max(alpha)
    r <- extendrange(c(alpha1, alpha2), f=0.2)
    leng <- 26
    alpha <- seq(r[1], r[2], length=leng)
    loglik <- rep(0,leng)
    theta_Finf <- NULL
    mi <- rep(1,n)
    if(verbose > 0) cat("Search in revised alpha interval (", format(range(alpha)), ")\n")
    if(verbose == 1) cat("Running up to", leng, ": ")
    for (i in 1:leng) {
      alpha0 <- alpha[i]
      theta0 <- NULL
      temp <- switch(resp.dist,
                     "bernoulli" = nr.bin(mi, xij, wij, theta0, yi, alpha0, select.dist, eps, itmax),
                     "poisson" = nr.pois(xij, wij, theta0, yi, alpha0, select.dist, eps, itmax, trunc.num),
                     "negbinomial" = nr.nbinom(xij, wij, theta0, yi, alpha0, select.dist, eps, itmax, trunc.num)
      )
      loglik[i] <- temp[[3]]
      theta_Finf <- c(theta_Finf,temp)
      if(verbose == 1) cat(i, "")
      if(verbose > 1) cat("i, alpha[i], logLik[i]:", i, format(alpha0), format(loglik[i]), "\n")
    }
    if(verbose == 1) cat("\n")
    numind <- which.max(loglik)
  }

  estPar <- theta_Finf[[(numind-1)*3+1]]
  FisherInf <- theta_Finf[[(numind-1)*3+2]]
  if(resp.dist=="negbinomial"){
    FisherInf_b <- FisherInf
    stderr_b <- sqrt(diag(solve(  FisherInf_b  )))[ncol(x.mat)+ncol(w.mat)+3]
    psi <- estPar[ncol(x.mat)+ncol(w.mat)+3] <- exp(estPar[ncol(x.mat)+ncol(w.mat)+3])
    inv.b.mat <- diag(c(rep(1, ncol(x.mat)+ncol(w.mat)+2), 1/psi))
    FisherInf <- inv.b.mat %*% FisherInf %*% inv.b.mat
  }
  stError <- sqrt(diag(solve(  FisherInf  )))

  if(standard) {
    ori_estPar <- as.vector(cm %*% estPar)
    cov_mle <- solve(FisherInf)
    cov_MLE <- cm %*% cov_mle %*% t(cm)
    ori_stError <- sqrt(diag(cov_MLE))

    if(resp.dist == "negbinomial"){
      cov_mle_b <- solve(FisherInf_b)
      cov_MLE_b <- cm %*% cov_mle_b %*% t(cm)
      ori_stderr_b <- sqrt(diag(cov_MLE_b))[ncol(x.mat)+ncol(w.mat)+3]
    }

    ori_estimate_response <- ori_estPar[1:(ncol(x.mat)+1)]
    ori_std_error_response <- ori_stError[1:(ncol(x.mat)+1)]
    ori_estimate_selection <- ori_estPar[(ncol(x.mat)+2):(ncol(x.mat)+ncol(w.mat)+2)]
    ori_std_error_selection <- ori_stError[(ncol(x.mat)+2):(ncol(x.mat)+ncol(w.mat)+2)]
  }

  if(resp.dist=="negbinomial") {
    theta <- estPar[ncol(x.mat)+ncol(w.mat)+3]
    theta.stderr <- stError[ncol(x.mat)+ncol(w.mat)+3]
    if(standard) {
      ori_theta <- ori_estPar[ncol(x.mat)+ncol(w.mat)+3]
      ori_theta.stderr <- ori_stError[ncol(x.mat)+ncol(w.mat)+3]
    }
  }

  Nloglik <- loglik[!is.na(loglik)]
  Nalpha <- alpha[!is.na(loglik)]
  numNA <- sum(is.na(loglik))

  maxloglik <- max(Nloglik)
  mle_alpha <- Nalpha[which.max(Nloglik)]

  stloglik <- loglik - maxloglik

  result <- list( call = cl, standard = standard, st_loglik = stloglik,
                  max_loglik = maxloglik, mle_alpha = mle_alpha,
                  alpha = alpha, Nalpha = Nalpha, num_NA = numNA,
                  n_select = n1, n_all = n,
                  estimate_response = estPar[1:(ncol(x.mat)+1)],
                  std_error_response = stError[1:(ncol(x.mat)+1)],
                  estimate_selection = estPar[(ncol(x.mat)+2):(ncol(x.mat)+ncol(w.mat)+2)],
                  std_error_selection = stError[(ncol(x.mat)+2):(ncol(x.mat)+ncol(w.mat)+2)]
  )
  if(resp.dist=="negbinomial"){
    result$theta <- theta
    result$theta_std_error <- theta.stderr
    result$std_error_b <- stderr_b
  }
  if(standard){
    result$ori_estimate_response <- ori_estimate_response
    result$ori_std_error_response <- ori_std_error_response
    result$ori_estimate_selection <- ori_estimate_selection
    result$ori_std_error_selection <- ori_std_error_selection
    if(resp.dist=="negbinomial"){
      result$ori_theta <- ori_theta
      result$ori_theta_std_error <- ori_theta.stderr
      result$ori_b_std_error <- ori_stderr_b
    }
  }
  result$names_response <- c( "(Intercept)", x.name )
  result$names_selection <- c( "(Intercept)", w.name )

  class(result) <- "DiSSMod"
  result
}

#' @method print DiSSMod
#' @export
print.DiSSMod <- function(x, digits = max(3, getOption("digits") - 3), ...)
{
  cat("\nCall:\n")
  print(x$call)
  cat("\nMLE of alpha:", x$mle_alpha, "\n")
  cat("Maximum log likelihood:", x$max_loglik, "\n")
  if(x$standard) cat("\n--------------- Standardized Result ---------------\n")
  cat("\nResponse coefficients:\n")
  print(t(data.frame(Estimate = x$estimate_response,
                     row.names = x$names_response, check.names=FALSE)), digits = digits )
  cat("\nSelection coefficients:\n")
  print(t(data.frame(Estimate = x$estimate_selection,
                     row.names = x$names_selection, check.names=FALSE)), digits = digits )
  cat("\n")
  if(x$standard) {
    cat("--------------- Original Scale Result ---------------\n")
    cat("\nResponse coefficients:\n")
    print(t(data.frame(Estimate = x$ori_estimate_response,
                       row.names = x$names_response, check.names=FALSE)), digits = digits )
    cat("\nSelection coefficients:\n")
    print(t(data.frame(Estimate = x$ori_estimate_selection,
                       row.names = x$names_selection, check.names=FALSE)), digits = digits )
    cat("\n")
  }
}

#' Summarizing Discrete Sample Selection Model Fits
#'
#' \code{summary} method for a class \code{"DiSSMod"}.
#'
#' @param object an object of class \code{"DiSSMod"} made by the function \code{DiSSMod}.
#' @param x an object of class \code{"summary.DiSSMod"}.
#' @param digits a numeric number of significant digits.
#' @param \dots additional control argument is as follows.
#' \itemize{
#' \item \code{level}: an option for controlling the significance level of confidence interval.
#' It has to be given in probability between 0 and 1. Initial level is set to \eqn{1 - \alpha = 0.95}.
#' }
#' @details If \code{standard} equals \code{TRUE}, \code{summary} also additionally returns summary
#' statistics of standardized results. Otherwise, it just returns summary statistics as similar statistics
#' as the generic function \code{summary}.
#' @return The function \code{summary.DiSSMod} returns a list of summary statistics of the fitted
#' discrete sample selection model given in \code{object}.
#' @return The components, which are not duplicated from the \code{object}, are as follows:
#' @return
#' \item{z.value_response}{Z statistics (normal distribution) for coefficients of response equation.}
#' \item{z.value_selection}{Z statistics (normal distribution) for coefficients of selection equation.}
#' \item{CI_alpha}{confidence interval of the parameter \code{alpha}.}
#' \item{level}{a numeric value between 0 and 1 for controlling the significance level of confidence interval.
#'  Initial level is set to \eqn{1 - \alpha = 0.95}.}
#' @examples
#' # example continued from DiSSMod
#' set.seed(45)
#' data(DoctorRWM, package = "DiSSMod")
#' n0 <- 600
#' set.n0 <- sample(1:nrow(DoctorRWM), n0)
#' reduce_DoctorRWM <- DoctorRWM[set.n0,]
#' result0 <- DiSSMod(response = as.numeric(DOCVIS > 0) ~ AGE + INCOME_SCALE + HHKIDS + EDUC + MARRIED,
#'                    selection = PUBLIC ~ AGE + EDUC + FEMALE,
#'                    data = reduce_DoctorRWM, resp.dist="bernoulli", select.dist = "normal",
#'                    alpha = seq(-5.5, -0.5, length.out = 21), standard = TRUE)
#'
#' summary(result0, level = 0.90)
#'
#' data(CreditMDR, package = "DiSSMod")
#' n1 <- 600
#' set.n1 <- sample(1:nrow(CreditMDR), n1)
#' reduce_CreditMDR <- CreditMDR[set.n1,]
#' result1 <- DiSSMod(response = MAJORDRG ~ AGE + INCOME + EXP_INC,
#'                    selection = CARDHLDR ~ AGE + INCOME + OWNRENT + ADEPCNT + SELFEMPL,
#'                    data = reduce_CreditMDR, resp.dist="poi", select.dist = "logis",
#'                    alpha = seq(-0.3, 0.3,length.out = 21), standard = FALSE, verbose = 1)
#'
#' summary(result1)
#'
#' @seealso See also \code{\link{DiSSMod}} and \code{\link[base]{summary}}.
#' @method summary DiSSMod
#' @export
summary.DiSSMod <- function(object, ...)
{
  z.value_response <- object$estimate_response/object$std_error_response
  z.value_selection <- object$estimate_selection/object$std_error_selection
  if(object$standard){
    ori_z.value_response <- object$ori_estimate_response/object$ori_std_error_response
    ori_z.value_selection <- object$ori_estimate_selection/object$ori_std_error_selection
  }
  ci <- confint.DiSSMod(object, ...)
  result <- list( call = object$call, standard = object$standard, n_all = object$n_all, n_select = object$n_select,
                  max_loglik = object$max_loglik, mle_alpha = object$mle_alpha,
                  estimate_response = object$estimate_response, std_error_response = object$std_error_response,
                  estimate_selection = object$estimate_selection, std_error_selection = object$std_error_selection,
                  z.value_response = z.value_response, z.value_selection = z.value_selection
  )
  if(object$standard){
    result$ori_estimate_response <- object$ori_estimate_response
    result$ori_std_error_response <- object$ori_std_error_response
    result$ori_estimate_selection <- object$ori_estimate_selection
    result$ori_std_error_selection <- object$ori_std_error_selection
    result$ori_z.value_response <- ori_z.value_response
    result$ori_z.value_selection <- ori_z.value_selection
  }
  if(!is.null(object$theta)){
    result$theta <- object$theta
    result$theta_std_error <- object$theta_std_error
    if(object$standard){
      result$ori_theta <- object$ori_theta
      result$ori_theta_std_error <- object$ori_theta_std_error
    }
  }
  result$CI_alpha <- ci$alpha
  result$level <- ci$level
  result$names_response <- object$names_response
  result$names_selection <- object$names_selection

  class(result) <- "summary.DiSSMod"
  result
}

#' @rdname summary.DiSSMod
#' @method print summary.DiSSMod
#' @export
print.summary.DiSSMod <- function(x, digits = max(3, getOption("digits") - 3), ...)
{
  cat("\nCall:\n")
  print(x$call)
  cat("\nNumber of observations:", x$n_all ,", selected (1):",
      x$n_select,", not selected (0):", x$n_all - x$n_select,"\n")
  cat("Maximum log likelihood:", format(x$max_loglik, nsmall = 2))
  cat("\nMLE of alpha:",x$mle_alpha,"\n")
  cat("\n", x$level*100, "% Confidence interval of alpha:\n")
  cat("\n", format(x$CI_alpha, digits=digits), "\n")
  if(x$standard) cat("\n--------------- Standardized Result ---------------\n")
  cat("\nResponse equation:\n")
  out <- data.frame( estimate = x$estimate_response, std.error = x$std_error_response,
                     z.value = x$z.value_response, pval = 2*pnorm(abs(x$z.value_response), lower.tail = F) )
  Signif.out <- symnum(out$pval, corr = FALSE, na = FALSE,
                       cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
                       symbols = c("***", "**", "*", ".", " "))
  out <- cbind(out, format(Signif.out))
  row.names(out) <- x$names_response ; colnames(out) <- c("Estimate", "Std. Error", "Z Value", "Pr(>|Z|)", "")
  print(out, digits = digits)
  if(!is.null(x$theta)){
    cat("\n              Theta: ", x$theta)
    cat("\n          Std. Err.: ", x$theta_std_error, "\n")
  }
  cat("\nSelection equation:\n")
  sel <- data.frame( estimate = x$estimate_selection, std.error = x$std_error_selection,
                     z.value = x$z.value_selection, pval = 2*pnorm(abs(x$z.value_selection), lower.tail = F) )
  Signif.sel <- symnum(sel$pval, corr = FALSE, na = FALSE,
                       cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
                       symbols = c("***", "**", "*", ".", " "))
  sel <- cbind(sel, format(Signif.sel))
  row.names(sel) <- x$names_selection ; colnames(sel) <- c("Estimate", "Std. Error", "Z Value", "Pr(>|Z|)", "")
  print(sel, digits = digits)

  cat("---\n")
  cat("Signif. codes:", 0, "'***'", 0.001, "'**'", 0.01, "'*'", 0.05, "'.'", 0.1, "' '", 1, "\n\n")
  if(x$standard) {
    cat("--------------- Original Scale Result ---------------\n")
    cat("\nResponse equation:\n")
    ori_out <- data.frame( estimate = x$ori_estimate_response, std.error = x$ori_std_error_response,
                           z.value = x$ori_z.value_response, pval = 2*pnorm(abs(x$ori_z.value_response), lower.tail = F) )
    ori_Signif.out <- symnum(ori_out$pval, corr = FALSE, na = FALSE,
                             cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
                             symbols = c("***", "**", "*", ".", " "))
    ori_out <- cbind(ori_out, format(ori_Signif.out))
    row.names(ori_out) <- x$names_response ; colnames(ori_out) <- c("Estimate", "Std. Error", "Z Value", "Pr(>|Z|)", "")
    print(ori_out, digits = digits)
    if(!is.null(x$ori_theta)){
      cat("\n              Theta: ", x$ori_theta)
      cat("\n          Std. Err.: ", x$ori_theta_std_error, "\n")
    }
    cat("\nSelection equation:\n")
    ori_sel <- data.frame( estimate = x$ori_estimate_selection, std.error = x$ori_std_error_selection,
                           z.value = x$ori_z.value_selection, pval = 2*pnorm(abs(x$ori_z.value_selection), lower.tail = F) )
    ori_Signif.sel <- symnum(ori_sel$pval, corr = FALSE, na = FALSE,
                             cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
                             symbols = c("***", "**", "*", ".", " "))
    ori_sel <- cbind(ori_sel, ori_Signif.sel)
    row.names(ori_sel) <- x$names_selection ; colnames(ori_sel) <- c("Estimate", "Std. Error", "Z Value", "Pr(>|Z|)", "")
    print(ori_sel, digits = digits)

    cat("---\n")
    cat("Signif. codes:", 0, "'***'", 0.001, "'**'", 0.01, "'*'", 0.05, "'.'", 0.1, "' '", 1, "\n\n")
  }
}

#' @method logLik DiSSMod
#' @export
logLik.DiSSMod <- function(object, ...)
{
  result <- object$max_loglik
  attr(result, "df") <- 1 + length(c(object$estimate_response, object$estimate_response, object$theta))
  class(result) <- "logLik"
  result
}

#' Getting Coefficients of Discrete Sample Selection Model Fits
#'
#' \code{coef} method for a class \code{"DiSSMod"}.
#'
#' @param object an object of class "DiSSMod" made by the function \code{DiSSMod}.
#' @param only a character value for choosing specific variable's coefficients. Initial value is
#' @param \dots not used, but exists because of the compatibility.
#' \code{NULL}, which shows all variable's coefficients. If \code{"response"} is written,
#'  only coefficients for response variables will be returned, and if \code{"selection"} is written,
#'  only coefficients for selection variables will be returned.
#' @return a numeric vector or a list is given.
#' @details It looks as similar as the generic function \code{coef}, but this case there
#' are two equations. Therefore, there exist little differences.
#' @examples
#' # example continued from DiSSMod
#' set.seed(45)
#' data(DoctorRWM, package = "DiSSMod")
#' n0 <- 600
#' set.n0 <- sample(1:nrow(DoctorRWM), n0)
#' reduce_DoctorRWM <- DoctorRWM[set.n0,]
#' result0 <- DiSSMod(response = as.numeric(DOCVIS > 0) ~ AGE + INCOME_SCALE + HHKIDS + EDUC + MARRIED,
#'                    selection = PUBLIC ~ AGE + EDUC + FEMALE,
#'                    data = reduce_DoctorRWM, resp.dist="bernoulli", select.dist = "normal",
#'                    alpha = seq(-5.5, -0.5, length.out = 21), standard = TRUE)
#'
#' coef(result0)
#'
#' data(CreditMDR, package = "DiSSMod")
#' n1 <- 600
#' set.n1 <- sample(1:nrow(CreditMDR), n1)
#' reduce_CreditMDR <- CreditMDR[set.n1,]
#' result1 <- DiSSMod(response = MAJORDRG ~ AGE + INCOME + EXP_INC,
#'                    selection = CARDHLDR ~ AGE + INCOME + OWNRENT + ADEPCNT + SELFEMPL,
#'                    data = reduce_CreditMDR, resp.dist="poi", select.dist = "logis",
#'                    alpha = seq(-0.3, 0.3,length.out = 21), standard = FALSE, verbose = 1)
#'
#' coef(result1)
#'
#' @seealso See also \code{\link{DiSSMod}} and \code{\link[stats]{coef}}.
#' @method coef DiSSMod
#' @export
coef.DiSSMod <- function(object, only = NULL, ...)
{
  if(object$standard){
    if (is.null(only)){
      names_response <- object$names_response
      names_selection <- object$names_selection
      result <- list( response = object$estimate_response,
                      selection = object$estimate_selection,
                      original_response = object$ori_estimate_response,
                      original_selection = object$ori_estimate_selection
      )
      names(result$response) <- names_response
      names(result$original_response) <- names_response
      names(result$selection) <- names_selection
      names(result$original_selection) <- names_selection
    } else if (only=="response"){
      names_response <- object$names_response
      result <- list( response = object$estimate_response,
                      original_response = object$ori_estimate_response
      )
      names(result$response) <- names_response
      names(result$original_response) <- names_response
    } else if (only=="selection"){
      names_selection <- object$names_selection
      result <- list( selection <- object$estimate_selection,
                      original_selection <- object$ori_estimate_selection
      )
      names(result$selection) <- names_selection
      names(result$original_selection) <- names_selection
    } else {
      stop("'only' has to be defined properly")
    }
  } else {
    if (is.null(only)){
      names_response <- object$names_response
      names_selection <- object$names_selection
      result <- list( response = object$estimate_response,
                      selection = object$estimate_selection
      )
      names(result$response) <- names_response
      names(result$selection) <- names_selection
    } else if (only=="response"){
      names_response <- object$names_response
      result <- object$estimate_response
      names(result) <- names_response
    } else if (only=="selection"){
      names_selection <- object$names_selection
      result <- object$estimate_selection
      names(result) <- names_selection
    } else {
      stop("'only' has to be defined properly")
    }
  }
  result
}

#' Getting Confidence Intervals for Parameters of Discrete Sample Selection Model Fits
#'
#' \code{confint} method for a class \code{"DiSSMod"}.
#'
#' @param object an object of class "DiSSMod" made by the function \code{DiSSMod}.
#' @param parm not used, but it exists for compatibility reasons.
#' @param level a numeric value between 0 and 1 for controlling the significance level of confidence interval;
#' default value is 0.95.
#' @param \dots not used, but it exists for compatibility reasons.
#' @return a list, containing \code{level} and \code{confidence intervals} for parameters, is given.
#'
#' @method confint DiSSMod
#' @examples
#' # example continued from DiSSMod
#' set.seed(45)
#' data(DoctorRWM, package = "DiSSMod")
#' n0 <- 600
#' set.n0 <- sample(1:nrow(DoctorRWM), n0)
#' reduce_DoctorRWM <- DoctorRWM[set.n0,]
#' result0 <- DiSSMod(response = as.numeric(DOCVIS > 0) ~ AGE + INCOME_SCALE + HHKIDS + EDUC + MARRIED,
#'                    selection = PUBLIC ~ AGE + EDUC + FEMALE,
#'                    data = reduce_DoctorRWM, resp.dist="bernoulli", select.dist = "normal",
#'                    alpha = seq(-5.5, -0.5, length.out = 21), standard = TRUE)
#'
#' confint(result0, level = 0.90)
#'
#' data(CreditMDR, package = "DiSSMod")
#' n1 <- 600
#' set.n1 <- sample(1:nrow(CreditMDR), n1)
#' reduce_CreditMDR <- CreditMDR[set.n1,]
#' result1 <- DiSSMod(response = MAJORDRG ~ AGE + INCOME + EXP_INC,
#'                    selection = CARDHLDR ~ AGE + INCOME + OWNRENT + ADEPCNT + SELFEMPL,
#'                    data = reduce_CreditMDR, resp.dist="poi", select.dist = "logis",
#'                    alpha = seq(-0.3, 0.3,length.out = 21), standard = FALSE, verbose = 1)
#'
#' confint(result1)
#'
#' @seealso See also \code{\link[stats]{confint}}, \code{\link[DiSSMod]{DiSSMod}} and \code{\link[DiSSMod]{summary.DiSSMod}}.
#' @export
confint.DiSSMod <- function(object, parm, level = 0.95, ...)
{
  coef <- coef(object)
  cf_resp <- coef$response
  cf_select <- coef$selection
  if(object$standard){
    cf_ori_resp <- coef$original_response
    cf_ori_select <- coef$original_selection
  }
  pnames_resp <- names(cf_resp)
  pnames_select <- names(cf_select)
  a <- (1 - level)/2
  a <- c(a, 1 - a)
  fac <- qnorm(a)
  pct <- paste(as.character(a*100), "%")
  ci_resp <- array( NA, dim = c(length(pnames_resp), 2), dimnames = list(pnames_resp, pct) )
  ci_select <- array( NA, dim = c(length(pnames_select), 2), dimnames = list(pnames_select, pct) )
  if(object$standard){
    ci_ori_resp <- array( NA, dim = c(length(pnames_resp), 2), dimnames = list(pnames_resp, pct) )
    ci_ori_select <- array( NA, dim = c(length(pnames_select), 2), dimnames = list(pnames_select, pct) )
  }
  ses_resp <- object$std_error_response
  ses_select <- object$std_error_selection
  if(object$standard){
    ses_ori_resp <- object$ori_std_error_response
    ses_ori_select <- object$ori_std_error_selection
  }
  ci_resp[] <- cf_resp[pnames_resp] + ses_resp %o% fac
  ci_select[] <- cf_select[pnames_select] + ses_select %o% fac
  if(object$standard){
    ci_ori_resp[] <- cf_ori_resp[pnames_resp] + ses_ori_resp %o% fac
    ci_ori_select[] <- cf_ori_select[pnames_select] + ses_ori_select %o% fac
  }
  q <- qchisq(level, 1)
  na <- is.na(object$st_loglik)
  af <- approxfun(object$Nalpha[!na], 2*object$st_loglik[!na] + q)
  vaf <- Vectorize(af)
  alpha1 <- try(uniroot(vaf, c(min(object$Nalpha), object$mle_alpha)), silent = TRUE)
  alpha1 <- if(class(alpha1)=="list") alpha1$root else NA
  alpha2 <- try(uniroot(vaf, c(object$mle_alpha, max(object$Nalpha))), silent = TRUE)
  alpha2 <- if(class(alpha2)=="list") alpha2$root else NA
  alpha <- c(alpha1, alpha2)
  names(alpha) <- pct
  if(!is.null(object$theta)){
    theta <- c(log(object$theta) + (object$std_error_b * fac))
    theta <- exp(theta)
    names(theta) <- pct
    if(object$standard){
      ori_theta <- c(log(object$ori_theta) + (object$ori_std_error_b * fac))
      ori_theta <- exp(ori_theta)
      names(ori_theta) <- pct
    }
  }

  if(object$standard){
    result <- list(level = level, alpha = alpha, response = ci_resp, selection = ci_select,
                   original_response = ci_ori_resp, original_selection = ci_ori_select)
    if(!is.null(object$theta)){
      result$theta <- theta
      result$ori_theta <- ori_theta
    }
  } else {
    result <- list(level = level, alpha = alpha, response = ci_resp, selection = ci_select)
    if(!is.null(object$theta)){
      result$theta <- theta
    }
  }
  result
}

#' Relative Log Likelihood Plot for Discrete Sample Selection Model Fits
#'
#' \code{plot} method for a class \code{"DiSSMod"}.
#'
#' @method plot DiSSMod
#' @param x an object of class "DiSSMod" made by the function \code{DiSSMod}.
#' @param \dots additional control argument is as follows.
#' \itemize{
#' \item \code{level}: an option for controlling the significance level of confidence interval.
#' It has to be given in probability between 0 and 1. Initial level is set to \eqn{1 - \alpha = 0.95}.
#' }
#' @details Function \code{plot} draws a convex line due to the values of twice relative
#' log likelihoods by using the profile likelihood approach with
#' following the grids of \code{alpha}. If confidence interval created from
#' the function \code{confint} exists between the maximum and minimum value of the \code{alpha},
#' there will be two points drawn with the color red. Also, the Maximum Likelihood Estimator (MLE)
#' of \code{alpha} can be seen easily, if it exists between the maximum and minimum value of
#' the \code{alpha}.
#' @examples
#' # example continued from DiSSMod
#' set.seed(45)
#' data(DoctorRWM, package = "DiSSMod")
#' n0 <- 600
#' set.n0 <- sample(1:nrow(DoctorRWM), n0)
#' reduce_DoctorRWM <- DoctorRWM[set.n0,]
#' result0 <- DiSSMod(response = as.numeric(DOCVIS > 0) ~ AGE + INCOME_SCALE + HHKIDS + EDUC + MARRIED,
#'                    selection = PUBLIC ~ AGE + EDUC + FEMALE,
#'                    data = reduce_DoctorRWM, resp.dist="bernoulli", select.dist = "normal",
#'                    alpha = seq(-5.5, -0.5, length.out = 21), standard = TRUE)
#'
#' plot(result0, level = 0.90)
#'
#' data(CreditMDR, package = "DiSSMod")
#' n1 <- 600
#' set.n1 <- sample(1:nrow(CreditMDR), n1)
#' reduce_CreditMDR <- CreditMDR[set.n1,]
#' result1 <- DiSSMod(response = MAJORDRG ~ AGE + INCOME + EXP_INC,
#'                    selection = CARDHLDR ~ AGE + INCOME + OWNRENT + ADEPCNT + SELFEMPL,
#'                    data = reduce_CreditMDR, resp.dist="poi", select.dist = "logis",
#'                    alpha = seq(-0.3, 0.3,length.out = 21), standard = FALSE, verbose = 1)
#'
#' plot(result1)
#'
#' @seealso See also \code{\link{DiSSMod}} and \code{\link[graphics]{plot}}.
#' @export
plot.DiSSMod <- function(x, ...)
{
  plot(x$Nalpha, 2*x$st_loglik[!is.na(x$st_loglik)], type="l", xlab = expression(alpha), ylab=expression("Twice relative log-likelihood of" ~~~ alpha))
  points(x$mle_alpha, min(2*x$st_loglik, na.rm = TRUE), col=2, pch=19)
  ci <- confint.DiSSMod(x, ...)
  abline(h=-qchisq(ci$level, 1), lty=2, col=2)
  rug(c(x$mle_alpha, ci$alpha), col=2)
}

#' German doctor first visits data
#'
#' Data is from Riphahn, Wambach and Million (2003), used for studying longitudinal
#' analysis concerning the usage of the German health insurance system. The original
#' data contain a few years data for patients, but we have only for first year.
#'
#' @usage data(DoctorRWM)
#' @format A data frame with 7293 observations of 26 variables as below;
#' \describe{
#'  \item{ID}{identification number (numeric)}
#'  \item{FEMALE}{female or not (categorical)}
#'  \item{YEAR}{year (categorical)}
#'  \item{AGE}{age (numeric)}
#'  \item{HSAT}{health satisfaction coded 0 (low) to 10 (high) (numeric)}
#'  \item{HANDDUM}{person is handicappe or not (categorical)}
#'  \item{HANDPER}{percentage degree of handicap (numeric)}
#'  \item{HHNINC}{monthly household net income (numeric)}
#'  \item{HHKIDS}{child (ren) below age 16 in household (numeric)}
#'  \item{EDUC}{years of schooling (numeric)}
#'  \item{MARRIED}{person is married or not (categorical)}
#'  \item{HAUPTS}{level of schooling (categorical)}
#'  \item{REALS}{level of schooling (categorical)}
#'  \item{FACHHS}{level of schooling (categorical)}
#'  \item{ABITUR}{level of schooling (categorical)}
#'  \item{UNIV}{level of schooling (categorical)}
#'  \item{WORKING}{employed or not (categorical)}
#'  \item{BLUEC}{person is blue collar worker or not (categorical)}
#'  \item{WHITEC}{person is white collar worker or not (categorical)}
#'  \item{SELF}{person is self-employed or not (categorical)}
#'  \item{BEAMT}{civil servant or not (categorical)}
#'  \item{DOCVIS}{number of doctor visits in last 3 months (numeric)}
#'  \item{HOSPVIS}{number of hospital visits last year (numeric)}
#'  \item{PUBLIC}{person is insured in public health insurance or not (categorical)}
#'  \item{ADDON}{person is insured in add-on insurance or not (categorical)}
#'  \item{INCOME_SCALE}{scaled income; original income/1000 (numeric)}
#' }
#' @source Riphahn, R. T., Wambach, A. and Million, A. (2003) Incentive Effects
#' in the Demand for Health Care: A Bivariate Panel Count Data Estimation,
#'  \emph{Journal of Applied Econometrics}, \strong{18}, 4, 387--405.
#'  Published online 8 October 2002.
#' \url{https://doi.org/10.1002/jae.680}
#' @source \url{http://qed.econ.queensu.ca/jae/2003-v18.4/riphahn-wambach-million/}
#' @references Greene, W. H. (2012) \emph{Econometric Analysis, 7th Edition}. Pearson
#' education.
#' @references Azzalini, A., Kim, H.-M. and Kim, H.-J. (2019) Sample selection
#' models for discrete and other non-Gaussian response variables.
#'  \emph{Statistical Methods & Applications}, \strong{28}, 27--56. First online 30 March 2018.
#' \url{https://doi.org/10.1007/s10260-018-0427-1}
"DoctorRWM"

#' Credit cards derogatory reports data
#'
#' Data is originally from Greene (1992), used for studying statistical credit scoring methods.
#'
#' @usage data(CreditMDR)
#' @format A data frame with 13444 observations of 8 variables as below;
#' \describe{
#'  \item{MAJORDRG}{count of major derogatory reports (numeric)}
#'  \item{CARDHLDR}{1 for cardholders, 0 for denied applicants (categorical)}
#'  \item{AGE}{age in years and twelfths of a year (numeric)}
#'  \item{INCOME}{primary income, divided by 10,000 (numeric)}
#'  \item{OWNRENT}{ownRent, individual owns (1) or rents (0) home (categorical)}
#'  \item{ADEPCNT}{not classified yet (numeric)}
#'  \item{SELFEMPL}{self employed; 1=yes, 0=no (categorical)}
#'  \item{EXP_INC}{average expenditure for 12 months/average monthly income (numeric)}
#' }
#' @source Greene, W. H. (1992) \emph{A Statistical Model for Credit
#' Scoring}. Working Paper No. EC-92-29, Department of Economics, Stern School
#' of Business, New York University, 1992.
#' @source \url{http://pages.stern.nyu.edu/~wgreene/Text/Edition7/tablelist8new.htm}
#' @references Greene, W. H. (2012) \emph{Econometric Analysis, 7th Edition}. Pearson
#' education.
#' @references Azzalini, A., Kim, H.-M. and Kim, H.-J. (2019) Sample selection
#' models for discrete and other non-Gaussian response variables.
#'  \emph{Statistical Methods & Applications}, \strong{28}, 27--56. First online 30 March 2018.
#' \url{https://doi.org/10.1007/s10260-018-0427-1}
"CreditMDR"

Try the DiSSMod package in your browser

Any scripts or data that you put into this service are public.

DiSSMod documentation built on May 2, 2019, 8:33 a.m.