R/f_multilevelIV.R
In REndo: Fitting Linear Models with Endogenous Regressors using Latent Instrumental Variables

Documented in multilevelIV

#'
#' @title Fitting Multilevel GMM Estimation with Endogenous Regressors
#'
#' @template template_param_formuladataverbose
#' @param lmer.control An output from \code{lmerControl} that will be used to fit the \code{lmer} model from which the variance and
#' correlation are obtained.
#'
#' @description
#' Estimates multilevel models (max. 3 levels) employing the GMM approach presented in Kim and Frees (2007).
#' One of the important features is that, using the hierarchical structure of the data, no external instrumental
#' variables are needed, unlike traditional instrumental variable techniques. Specifically, the approach controls for
#' endogeneity at higher levels in the data hierarchy. For example, for a three-level model, endogeneity can be handled
#' either if present at level two, at level three or at both levels. Level one endogeneity, where the regressors are correlated
#' with the structural errors (errors at level one), is not addressed. Moreover, if considered, random slopes cannot be endogenous.
#' Also, the dependent variable has to have a continuous distribution.
#' The function returns the coefficient estimates obtained with fixed effects, random effects and the GMM estimator proposed
#' by Kim and Frees (2007), such that a comparison across models can be done.
#' Asymptotically, the multilevel GMM estimators share the same properties of corresponding fixed effects estimators, but they
#' allow the estimation of all the variables in the model, unlike the fixed effects counterpart.
#'
#' To facilitate the choice of the estimator to be used for the given data, the function also conducts
#' omitted variable test based on the Hausman-test for panel data (Hausman, 1978). It allows to compare
#' a robust estimator and an estimator that is efficient under the null hypothesis of no omitted variables,
#' and to compare two robust estimators at different levels. The results of these tests are returned when
#' calling \code{\link[REndo:summary.rendo.multilevel]{summary()}} on a fitted model.
#'
#' @details
#' \subsection{Method}{
#' Multilevel modeling is a generalization of regression methods that recognize the existence of such data hierarchies
#' by allowing for residual components at each level in the hierarchy. For example, a three-level multilevel model which
#' allows for grouping of students within classrooms, over time, would include time, student and classroom residuals
#' (see equation below). Thus, the residual variance is partitioned into four components:
#' between-classroom (the variance of the classroom-level residuals), within-classroom (the variance of the student-level residuals),
#' between student (the variance of the student-level residuals) and within-student (the variance of the time-level residuals).
#' The classroom residuals represent the unobserved classroom characteristics that affect student's outcomes.
#' These unobserved variables lead to correlation between outcomes for students from the same classroom.
#' Similarly, the unobserved time residuals lead to correlation between a student's outcomes over time.
#' A three-level model can be described as follows:
#'
#' \ifelse{html}{
#' \out{ <div style="text-align:center">y<sub>cst</sub> = Z<sup>1</sup><sub>cst</sub> &beta;<sup>1</sup><sub>cs</sub> + X<sup>1</sup><sub>cst</sub> &beta;<sub>1</sub> &epsilon;<sup>1</sup><sub>cst</sub> </div>
#' <br><div style="text-align:center"> &beta;<sup>1</sup><sub>cs</sub> =  Z<sup>2</sup><sub>cs</sub> &beta;<sup>2</sup><sub>c</sub> + X <sup>2</sup><sub>cs</sub> &beta;<sub>2</sub> &epsilon;<sup>2</sup><sub>cs</sub> </div>
#' <br><div style="text-align:center"> &beta;<sup>2</sup><sub>c</sub> = X<sup>3</sup><sub>c</sub>&beta;<sub>3</sub> &epsilon;<sup>3</sup><sub>c</sub> </div>}}{
#'  \deqn{
#'  y_{cst} = Z^{1}_{cst} \beta^{1}_{cs} + X^{1}_{cst} \beta_{1} +\epsilon^{1}_{cst}}
#'  \deqn{
#'  \beta^{1}_{cs} = Z^{2}_{cs} \beta^{2}_{c} + X^{2}_{cs} \beta_{2} +\epsilon^{2}_{cs}}
#'  \deqn{
#'  \beta^{2}_{c} = X^{3}_{c} \beta_{3} +\epsilon^{3}_{c}}.}
#'
#' Like in single-level regression, in multilevel models endogeneity is also a concern. The additional problem is that in multilevel models
#' there are multiple independent assumptions involving various random components at different levels. Any moderate correlation between some
#' predictors and a random component or error term, can result in a significant bias of the coefficients and of the variance components.
#' The multilevel GMM approach for addressing endogeneity uses both the between and within variations of the exogenous variables, but only the within
#' variation of the variables assumed endogenous. The assumptions in the multilevel generalized moment of moments model is that the errors at each level
#' are normally distributed and independent of each other. Moreover, the slope variables are assumed exogenous. Since the model does not handle
#' "level 1 dependencies", an additional assumption is that the level 1 structural error is uncorrelated with any of the regressors.
#' If this assumption is not met, additional, external instruments are necessary.
#' The coefficients of the explanatory variables appear in the vectors \ifelse{html}{\out{&beta;<sub>1</sub>}}{\eqn{\beta_{1}}},
#' \ifelse{html}{\out{&beta;<sub>2</sub>}}{\eqn{\beta_{2}}} and \ifelse{html}{\out{&beta;<sub>3</sub>}}{\eqn{\beta_{3}}}.
#' The term \ifelse{html}{\out{&beta;<sup>1</sup><sub>cs</sub>}}{\eqn{\beta^{1}_{cs}}} captures latent, unobserved characteristics that are classroom and student specific
#' while \ifelse{html}{\out{&beta;<sup>2</sup><sub>c</sub>}}{\eqn{\beta^{2}_{c}}} captures latent,
#' unobserved characteristics that are classroom specific. For identification, the disturbance term
#' \ifelse{html}{\out{&epsilon;<sub>cst</sub>}}{\eqn{\epsilon_{cst}}} is assumed independent of
#' the other variables, \ifelse{html}{\out{Z<sup>1</sup><sub>cst</sub>}}{\eqn{Z^{1}_{cst}}} and \ifelse{html}{\out{X<sup>1</sup><sub>cst</sub>}}{\eqn{X^{1}_{cst}}}.
#' When all model variables are assumed exogenous, the GMM estimator is the usual GLS estimator, denoted as REF. When all variables (except the variables used as slope)
#' are assumed endogenous, the fixed-effects estimator is used, FE. While REF assumes all explanatory variables are uncorrelated with the
#' random intercepts and slopes in the model, FE allows for endogeneity of all effects but sweeps out the random components as well as the
#' explanatory variables at the same levels. The more general estimator GMM proposed by Kim and Frees (2007) allows for some of the explanatory
#' variables to be endogenous and uses this information to build instrumental variables. The multilevel GMM estimator uses both the between and
#' within variations of the exogenous variables, but only the within variation of the variables assumed endogenous. When all variables are assumed
#' exogenous, GMM estimator equals REF. When all covariates are assume endogenous, GMM equals FE.
#' }
#'
#'\subsection{Formula parameter}{
#'
#' The \code{formula} argument follows a two part notation:
#'
#' In the first part, the model is specified while in the second part, the endogenous regressors are indicated.
#' These two parts are separated by a single vertical bar (\code{|}).
#'
#' The first RHS follows the exact same model specification as required by the \code{\link[lme4]{lmer}}
#' function of package \code{lme4} and internally will be used to fit a \code{lmer} model. In the second part,
#' one or multiple endogenous regressors are indicated by passing them to the special function \code{endo}
#' (e.g. \code{endo(X1, X2)}). Note that no argument to \code{endo()} is to be supplied as character
#' but as symbols without quotation marks.
#'
#' See the example section for illustrations on how to specify the \code{formula} parameter.
#' }
#'
#' @seealso \code{\link[lme4]{lmer}} for more details on how to specify the \code{formula} parameter
#' @seealso \code{\link[lme4]{lmerControl}} for more details on how to provide the \code{lmer.control} parameter
#' @seealso \code{\link[REndo:summary.rendo.multilevel]{summary}} for how fitted models are summarized
#'
#' @return
#' \code{multilevelIV} returns an object of class "\code{rendo.multilevel}".
#'
#' The generic accessor functions \code{coef}, \code{fitted}, \code{residuals}, \code{vcov}, \code{confint}, and \code{nobs}, are available.
#' Note that an additional argument \code{model} with possible values \code{"REF", "FE_L2", "FE_L3", "GMM_L2"}, or \code{"GMM_L3"} is
#' available for \code{\link[REndo:summary.rendo.multilevel]{summary}}, \code{fitted}, \code{residuals}, \code{confint}, and \code{vcov}
#' to extract the features for the specified model.
#'
#' Note that the obtained coefficients are rounded with \code{round(x, digits=getOption("digits"))}.
#'
#'
#' An object of class \code{rendo.multilevel} is returned that is a list and contains the following components:
#'
#' \item{formula}{the formula given to specify the model to be fitted.}
#' \item{num.levels}{the number of levels detected from the model.}
#' \item{dt.model.data}{a data.table of model data including data for slopes and level group ids}
#' \item{coefficients}{a matrix of rounded coefficients, one column per model.}
#' \item{coefficients.se}{a matrix of coefficients' SE, one column per model.}
#' \item{l.fitted}{a named list which contains the fitted values per model sorted as the input data}
#' \item{l.residuals}{a named list which contains the residuals per model sorted as the input data}
#' \item{l.vcov}{a list of variance-covariance matrix, named per model.}
#' \item{V}{the variance–covariance matrix V of the disturbance term.}
#' \item{W}{the weight matrix W, such that W=V^(-1/2) per highest level group.}
#' \item{l.ovt}{a list of results of the Hausman OVT, named per model.}
#'
#' @examples
#' \donttest{
#' data("dataMultilevelIV")
#'
#' # Two levels
#' res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 +
#'                               X32 + X33 + (1|SID) | endo(X15),
#'                           data = dataMultilevelIV, verbose = FALSE)
#'
#' # Three levels
#' res.ml.L3 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 +
#'                               X32 + X33 + (1| CID) + (1|SID) | endo(X15),
#'                           data = dataMultilevelIV, verbose = FALSE)
#'
#'
#' # L2 with multiple endogenous regressors
#' res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 +
#'                               X32 + X33 + (1|SID) | endo(X15, X21, X22),
#'                           data = dataMultilevelIV, verbose = FALSE)
#'
#' # same as above
#' res.ml.L2 <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 +
#'                               X32 + X33 + (1|SID) | endo(X15, X21) + endo(X22),
#'                           data = dataMultilevelIV, verbose = FALSE)
#'
#' # Fit above model with different settings for lmer()
#' lmer.control <- lme4::lmerControl(optimizer="nloptwrap",
#'                                   optCtrl=list(algorithm="NLOPT_LN_COBYLA",
#'                                                xtol_rel=1e-6))
#' res.ml.L2.cob <- multilevelIV(y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 +
#'                                   X31 + X32 + X33 + (1|SID) | endo(X15, X21) + endo(X22),
#'                               data = dataMultilevelIV, verbose = FALSE,
#'                               lmer.control = lmer.control) # use different controls for lmer
#'
#'
#' # specify argument "model" in the S3 methods to obtain results for the respective model
#' # default is "REF" for all methods
#'
#' summary(res.ml.L3)
#' # same as above
#' summary(res.ml.L3, model = "REF")
#'
#' # complete pval table for L3 fixed effects
#' L3.FE.p <- coef(summary(res.ml.L3, model = "FE_L3"))
#'
#' # variance covariance matrix
#' L2.FE.var  <- vcov(res.ml.L2, model = "FE_L2")
#' L2.GMM.var <- vcov(res.ml.L2, model = "GMM_L2")
#' # residuals
#' L3.REF.resid <- resid(res.ml.L3, model = "REF")
#' }
#'
#' @references
#' Hausman J (1978). “Specification Tests in Econometrics.” Econometrica, 46(6), 1251–1271.
#'
#' Kim, Jee-Seon and Frees, Edward W. (2007). "Multilevel Modeling with Correlated Effects". Psychometrika, 72(4), 505-533.
#'
#' @importFrom lme4 lmer VarCorr lFormula nobars lmerControl
#' @importFrom Formula as.Formula
#' @importFrom data.table as.data.table
#' @export
multilevelIV <- function(formula, data, lmer.control=lmerControl(optimizer = "Nelder_Mead", optCtrl=list(maxfun=100000)), verbose=TRUE){
  # As default optimizer for consistent estimates use NelderMead (100k evals to be sure)

  .SD <- NULL # cran silence

  cl <- match.call()

  # Model Definitions -----------------------------------------------------------
  # (1) = level-1 model
  #   (1)_sc = variability at child level
  #
  # (2) = level-2 model
  #   Z(2)_sc, X(2)_sc=child or school and do not vary over time
  #
  # (3) = level-3 model = variability at school level
  #   X(3)_s = depend on school
  #
  # Define:
  #   Z_2sct = Z(1)_sct
  #   Z_3sct = Z(1)_sctZ(2)_sc
  #   Z_3s = stacked Z_3sct
  #   X_sct = (X(1)_sct:Z_2sct)
  #
  #   X(1)=stacked X(1)_sc

  # Check input -----------------------------------------------------------------
  check_err_msg(checkinput_multilevel_formula(formula=formula))
  check_err_msg(checkinput_multilevel_data(data=data))
  check_err_msg(checkinput_multilevel_dataVSformula(formula=formula, data=data))
  check_err_msg(checkinput_multilevel_lmercontrol(lmer.control=lmer.control))
  check_err_msg(checkinput_multilevel_verbose(verbose = verbose))


  # Extract information ---------------------------------------------------------
  F.formula <- Formula::as.Formula(formula)
  f.lmer    <- formula(F.formula, lhs = 1, rhs = 1)
  names.endo <- formula_readout_special(F.formula = F.formula, name.special = "endo",
                                       from.rhs = 2, params.as.chars.only = TRUE)

  # Let lme4 do the formula processing
  l4.form    <- lme4::lFormula(formula = f.lmer, data=data)
  num.levels <- lme4formula_get_numberoflevels(l4.form)

  # Build model data ------------------------------------------------------------
  #
  # data.table to split into groupwise matrices
  #
  # Consists of:
  #   Response
  #   Intercept, if needed
  #   Model variables
  #   slope variables
  #   L2 / L3 Group Ids
  #   rownames, always
  #
  #   model.matrix: intercept + model vars + slope vars
  #   model.frame:  response + group ids
  #
  #   Rownames: There always are at least the standard rownames in a data.frame and it cannot
  #               be distinguished between real and standard rownames. Therefore, and to mimic
  #               stats::lm, rownames are always saved here and then added back to the results

  #
  # dependent variable
  dt.response <- as.data.table(l4.form$fr[, 1, drop = FALSE], keep.rownames = "rownames")
  name.y      <- colnames(l4.form$fr)[1]  # always at first position, same as model.response reads out

  #
  # FE / X
  #   X1 is everything but endogenous
  dt.FE    <- as.data.table(l4.form$X)
  names.X  <- colnames(dt.FE)
  names.X1 <- setdiff(names.X, names.endo)

  #
  # Slopes / Z
  #   Has to be taken from X (model matrix) in case of factors in slope
  #   Z names cannot be read as depends on number of levels
  dt.slp <- as.data.table(l4.form$X[, unique(unlist(l4.form$reTrms$cnms)), drop=FALSE])


  #
  # Group ids
  #   Do not copy from input data but from model.frame because the user could base it on
  #     transformation (ie y~X+(1|I(id == "abc"))) and because row sorting could be different
  dt.groudids <- as.data.table(l4.form$fr[, unique(unlist(names(l4.form$reTrms$cnms))), drop=FALSE])

  #
  # Make single data.table with minimum required cols only
  names.min.req.cols <- unique(c(colnames(dt.response), colnames(dt.FE),
                                 colnames(dt.slp), colnames(dt.groudids)))
  dt.model.data <- cbind(dt.response, dt.FE, dt.slp, dt.groudids)[, .SD, .SDcols = names.min.req.cols]
  rm(dt.response, dt.FE, dt.slp, dt.groudids)


  # Fit REML -----------------------------------------------------------------------------------
  # p515: "With the transformed data, we may apply any of the usual procedures for
  #         estimating variance components, including maximum likelihood, restricted maximum likelihood (REML)"
  #
  # Same for L2 and L3 cases and doing it here avoids passing and processing the formula and original data
  #   in the level functions
  # Has to use the original data because dt.model.data already contains the data
  #   with applied transformations

  if(verbose)
    message("Fitting linear mixed-effects model ",format(f.lmer),".")

  res.lmer <- tryCatch(lme4::lmer(formula = f.lmer, data=data, REML = TRUE,
                                  control = lmer.control),
                       error = function(e)return(e))
  if(is(res.lmer, "error"))
    check_err_msg(paste0("lme4::lmer() could not be fitted with error: ",sQuote(res.lmer$message), "\nPlease revise your data and formula."))

  res.VC <- tryCatch(lme4::VarCorr(res.lmer),
                     error = function(e)return(e))
  if(is(res.VC, "error"))
    check_err_msg(paste0("lme4::VarCorr() could not be fitted with error: ", sQuote(res.VC$message), "\nPlease revise your data and formula."))


  # Fit multilevel model -----------------------------------------------------------------------

  if(num.levels == 2){
    # Read Z names here as depends on num.levels
    name.groupid.L2  <- names(l4.form$reTrms$cnms)[[1]]
    names.Z2         <- l4.form$reTrms$cnms[[name.groupid.L2]]

    res <- multilevel_2levels(cl = cl, f.orig = formula, dt.model.data = dt.model.data, res.VC = res.VC,
                              name.group.L2 = name.groupid.L2, name.y = name.y, names.X = names.X,
                              names.X1 = names.X1, names.Z2 = names.Z2,
                              verbose = verbose)
  } else{

    # Read out the level names
    #   cnms is list sorted by cluster size with smalles cluster (=L2) first
    name.groupid.L2  <- names(l4.form$reTrms$cnms)[[1]]
    name.groupid.L3  <- names(l4.form$reTrms$cnms)[[2]]
    names.Z2         <- l4.form$reTrms$cnms[[name.groupid.L2]]
    names.Z3         <- l4.form$reTrms$cnms[[name.groupid.L3]]

    res <- multilevel_3levels(cl = cl, f.orig = formula, dt.model.data = dt.model.data, res.VC = res.VC,
                                name.group.L2 = name.groupid.L2, name.group.L3 = name.groupid.L3,
                                name.y = name.y, names.X = names.X, names.X1 = names.X1,
                                names.Z2 = names.Z2, names.Z3 = names.Z3,
                                verbose = verbose)
  }

  # Sort fitted and residuals by original input data (rownames)
  res$l.fitted    <- lapply(res$l.fitted,    function(fit){fit[rownames(data)]})
  res$l.residuals <- lapply(res$l.residuals, function(resid){resid[rownames(data)]})

  return(res)
}
Any scripts or data that you put into this service are public.
REndo documentation built on July 3, 2024, 1:06 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
REndo
Fitting Linear Models with Endogenous Regressors using Latent Instrumental Variables

R/f_multilevelIV.R
In REndo: Fitting Linear Models with Endogenous Regressors using Latent Instrumental Variables

Defines functions multilevelIV

Documented in multilevelIV

Try the REndo package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

REndo Fitting Linear Models with Endogenous Regressors using Latent Instrumental Variables

R/f_multilevelIV.R In REndo: Fitting Linear Models with Endogenous Regressors using Latent Instrumental Variables

Defines functions multilevelIV

Documented in multilevelIV

Try the REndo package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

REndo
Fitting Linear Models with Endogenous Regressors using Latent Instrumental Variables

R/f_multilevelIV.R
In REndo: Fitting Linear Models with Endogenous Regressors using Latent Instrumental Variables