spflow: Spatial Econometric Interaction Models

Documented in spflow

#' @title
#' Estimate spatial interaction models that incorporate spatial dependence
#'
#' @description
#' We implement three different estimators of spatial econometric interaction
#' models \insertCite{Dargel2021}{spflow} that allow the user to estimate
#' origin-destination flows with spatial autocorrelation.
#'
#' By default the estimation will include spatial dependence in the dependent
#' variable and the explanatory variables, which leads to the spatial Durbin
#' model (SDM) \insertCite{Anselin1988}{spflow}.
#' Moreover, the model includes an additional set of parameters for intra
#' regional flows that start and end in the same geographic site (as proposed
#' by \insertCite{LeSage2009;textual}{spflow}).
#' Both default options can be deactivated via the `estimation_control` argument,
#' which gives fine grained control over the estimation.
#'
#'
#' @param spflow_formula
#'   A formula specifying the spatial interaction model (for details see
#'   section Formula interface)
#' @param spflow_networks
#'   A [spflow_network_multi()] object that contains information on the
#'   origins, the destinations and their neighborhood structure
#' @param id_net_pair
#'   A character indicating the id of a [spflow_network_pair()] (only relevant if
#'   the [spflow_network_multi()] contains multiple `spflow_network_pair`-objects:
#'   defaults to the of them)
#' @param estimation_control
#'   A list generated by [spflow_control()] that provides fine grained control
#'   over the estimation procedure
#' @return An S4 class of type [spflow_model-class()]
#'
#'
#' @section Details:
#' Our estimation procedures makes use of the matrix formulation introduced by
#' \insertCite{LeSage2008;textual}{spflow} and further developed by
#' \insertCite{Dargel2021;textual}{spflow} to reduce the computational
#' effort and memory requirements.
#' Further generalizations to deal with non-cartesian and rectangular flows
#' are developed by \insertCite{Dargel2023;textual}{spflow}.
#'
#' The estimation procedure can be adjusted through the `estimation_method`
#' argument in [spflow_control()].
#'
#' ## Maximum likelihood estimation (MLE)
#' Maximum likelihood estimation is the default estimation procedure.
#' The matrix form estimation in the framework of this model was first
#' developed by \insertCite{LeSage2008;textual}{spflow} and then improved by
#' \insertCite{Dargel2021;textual}{spflow}.
#'
#' ## Spatial two-stage least squares (S2SLS)
#' The S2SLS estimator is an adaptation of the one proposed by
#' \insertCite{Kelejian1998;textual}{spflow}, to the case of origin-destination
#' flows, with up to three neighborhood matrices
#' \insertCite{Dargel2021;textual}{spflow}.
#' A similar estimation is done by \insertCite{Tamesue2016;textual}{spflow}.
#' The user can activate the S2SLS estimation via the `estimation_control` argument
#' using the input `spflow_control(estimation_method = "s2sls")`.
#'
#' ## Bayesian Markov Chain Monte Carlo (MCMC)
#' The MCMC estimator is based on the ideas of
#' \insertCite{LeSage2009;textual}{spflow} and incorporates the improvements
#' proposed in \insertCite{Dargel2021;textual}{spflow}.
#' The estimation is based on a tuned Metropolis-Hastings sampler for the
#' auto-regressive parameters, and for the remaining parameters it uses Gibbs
#' sampling.
#' The routine uses 5500 iterations of the sampling procedure and considers the
#' first 2500 as burn-in period.
#' The user can activate the S2SLS estimation via the `estimation_control` argument
#' using the input `spflow_control(estimation_method = "mcmc")`.
#'
#' ## Formula interface
#' The function offers a formula interface adapted to spatial interaction
#' models, which has the following structure:
#' `Y ~ O_(X1) + D_(X2) + I_(X3) + P_(X4)`
#' This structure reflects the different data sources involved in such a model.
#' On the left hand side there is the independent variable `Y` which
#' corresponds to the vector of flows.
#' On the right hand side we have all the explanatory variables.
#' The functions `O_(...)` and `D_(...)` indicate which variables are used as
#' characteristics of the origins and destinations respectively.
#' Similarly, `I_(...)` indicates variables that should be used for the
#' intra-regional parameters.
#' Finally, `P_(...)` declares which variables describe origin-destination
#' pairs, which most frequently will include a measure of distance.
#'
#' All the declared variables must be available in the provided
#' [spflow_network_multi()] object, which gathers information on the origins and
#' destinations (inside [spflow_network()] objects), as well as the
#' information on the origin-destination pairs (inside a [spflow_network_pair()]
#' object).
#'
#' Using the short notation `Y ~ .` is possible and will be interpreted as
#' usual, in the sense that we use all variables that are available for each
#' data source.
#' Also mixed formulas, such as `Y ~ . + P_(log(X4) + 1)`, are possible.
#' When the dot shortcut is combined with explicit declaration, it will only be
#' used for the non declared data sources.
#' The following examples illustrate this behavior.
#'
#' ## Formula interface (examples)
#'
#' Consider the case where we have the flow vector `Y` and the distance vector
#' `DIST` available as information on origin-destination pairs.
#' In addition we have  the explanatory variables `X1, X2` and `X3` which
#' describe the regions that are at the same time origins and destinations of
#' the flows.
#'
#' For this example the four formulas below are equivalent and make use of all
#' explanatory variables `X1, X2` and `X3` for origins, destinations and
#' intra-regional observations.
#'
#' - `Y ~ .`
#' - `Y ~ . + P_(DIST)`
#' - `Y ~ X1 + X2 + X3 + P_(DIST)`
#' - `Y ~ D_(X1 + X2 + X3) + O_(X1 + X2 + X3) + I_(X1 + X2 + X3)  + P_(DIST)`
#'
#' Now if we only want to use X1 for the intra-regional model we can do the
#' following (again all four options below are equivalent).
#'
#' - `Y ~ . + I_(X1)`
#' - `Y ~ . + I_(X1) + P_(DIST)`
#' - `Y ~  X1 + X2 + X3 + I_(X1) + P_(DIST)`
#' - `Y ~ D_(X1 + X2 + X3) + O_(X1 + X2 + X3) + I_(X1)  + P_(DIST)`
#'
#' This behavior is easily combined with transformation of variables as the
#' two equivalent options below illustrate.
#'
#' - `log(Y + 1) ~ sqrt(X1) +  X2 + P_(log(DIST + 1))`
#' @examples
#'
#' # Estimate flows between the states of Germany
#' spflow(spflow_formula = y9 ~ . + P_(DISTANCE),
#'        spflow_networks = multi_net_usa_ge,
#'        id_net_pair = "ge_ge")
#'
#' # Same as above with explicit declaration of variables...
#' # ... X is the only variable available
#' # ... it is used for origins, destination and intra-state flows
#' spflow(spflow_formula = y9 ~ X + P_(DISTANCE),
#'        spflow_networks = multi_net_usa_ge,
#'        id_net_pair = "ge_ge")
#'
#' # Same as above
#' spflow(spflow_formula = y9 ~ O_(.) + D_(.) + I_(.) + P_(DISTANCE),
#'        spflow_networks = multi_net_usa_ge,
#'        id_net_pair = "ge_ge")
#'
#' # Same as above
#' spflow(spflow_formula = y9 ~ O_(X) + D_(X) + I_(X) + P_(DISTANCE),
#'        spflow_networks = multi_net_usa_ge,
#'        id_net_pair = "ge_ge")
#'
#'
#' @references \insertAllCited{}
#' @seealso [spflow_control()] [spflow_network_classes()]
#' @author Lukas Dargel
#' @export
spflow <- function(
    spflow_formula,
    spflow_networks,
    id_net_pair = id(spflow_networks)[["pairs"]][[1]],
    estimation_control = spflow_control()) {

  ## check and correct input arguments
  assert_is(spflow_formula, "formula")
  assert_is_single_x(id_net_pair, "character")

  od_ids <- id(pull_member(spflow_networks, id_net_pair))
  estimation_control <- enhance_spflow_control(
    estimation_control = estimation_control,
    is_within = od_ids["orig"] == od_ids["dest"])

  spflow_networks <- validate_networks(
    spflow_networks = spflow_networks,
    id_net_pair =  id_net_pair,
    model = estimation_control[["model"]])


  spflow_matrices <- derive_spflow_matrices(
    id_net_pair = id_net_pair,
    spflow_networks = spflow_networks,
    spflow_formula = spflow_formula,
    spflow_control = estimation_control,
    na_rm = estimation_control[["na_rm"]])
  spflow_indicators <- spflow_matrices[["spflow_indicators"]]
  spflow_matrices[["spflow_indicators"]] <- NULL

  spflow_nbfunctions <- derive_spflow_nbfunctions(
    OW = neighborhood(spflow_networks, od_ids["orig"]),
    DW = neighborhood(spflow_networks, od_ids["dest"]),
    estimation_control = estimation_control,
    spflow_indicators = spflow_indicators)

  spflow_obs <- spflow_indicators2obs(spflow_indicators)
  spflow_moments <- derive_spflow_moments(
    spflow_matrices = spflow_matrices,
    n_o = spflow_obs[["N_orig"]],
    n_d = spflow_obs[["N_dest"]],
    N = spflow_obs[["N_sample"]],
    wt = spflow_indicators2wtmat(spflow_indicators))

  estimation_results <- spflow_estimation(
    spflow_moments = spflow_moments,
    spflow_nbfunctions = spflow_nbfunctions,
    estimation_control = estimation_control)

  estimation_results <- spflow_post_estimation(
    estimation_results,
    spflow_networks = spflow_networks,
    spflow_matrices = spflow_matrices,
    spflow_moments = spflow_moments,
    spflow_indicators = spflow_indicators,
    spflow_formula = spflow_formula)

  return(estimation_results)
}


#' @keywords internal
parameter_names <- function(
    model_matrices,
    model) {

  names_rho <- define_spatial_lag_params(model)
  names_const <- c("(Intercept)", "(Intra)")
  use_const <- c(model_matrices$constants$global == 1,
                 !is.null(model_matrices$constants$intra$In))
  names_const <- names_const[use_const]

  x_prefs <- list("D_" = "DEST_","O_" = "ORIG_","I_" = "INTRA_")
  names_X <- lapply(compact(model_matrices[names(x_prefs)]), "colnames")
  names_X <- Map("paste0", x_prefs[names(names_X)], names_X)
  names_X <- unlist(names_X, use.names = FALSE)

  names_G <- names(model_matrices$G)
  export_names <- c(names_rho,names_const,names_X,names_G)

  return(export_names)

}

#' @keywords internal
drop_instruments <- function(model_matrices) {

  filter_inst_list <-
    function(l) { Filter(Negate(attr_inst_status), model_matrices[[mm]]) }
  mm <- "CONST"
  model_matrices[[mm]] <- filter_inst_list(model_matrices[[mm]])
  mm <- "P_"
  model_matrices[[mm]] <- filter_inst_list(model_matrices[[mm]])

  filter_inst_mat <-
    function(mat) {
      if (is.null(mat))
        return(NULL)
      is_inst <- attr_inst_status(mat)
      mat <- mat[,!is_inst, drop = FALSE]
      attr_inst_status(mat) <- is_inst[!is_inst]
      mat
    }
  mm <- "D_"
  model_matrices[[mm]] <- filter_inst_mat(model_matrices[[mm]])
  mm <- "O_"
  model_matrices[[mm]] <- filter_inst_mat(model_matrices[[mm]])
  mm <- "I_"
  model_matrices[[mm]] <- filter_inst_mat(model_matrices[[mm]])

  return(model_matrices)
}

#' @keywords internal
sp_model_type <- function(cntrl) {
  has_lagged_x <- cntrl$sdm_variables != "none"

  if (cntrl$model == "model_1")
    model_type <- ifelse(has_lagged_x,"SLX","OLM")

  if (cntrl$model != "model_1")
    model_type <- ifelse(has_lagged_x,"SDM","LAG")

  return(model_type)
}


#' @keywords internal
derive_spflow_nbfunctions <- function(
    OW,
    DW,
    estimation_control,
    spflow_indicators) {


  if (estimation_control[["model"]] == "model_1")
    return(NULL)

  pspace_validator <- derive_pspace_validator(
    OW_character = attr_spectral_character(OW),
    DW_character = attr_spectral_character(DW),
    model = estimation_control[["model"]],
    estimation_method = estimation_control[["estimation_method"]])

  if (estimation_control[["estimation_method"]] == "s2sls")
    return(list("pspace_validator" = pspace_validator))

  logdet_calculator <- derive_logdet_calculator(
    OW = OW,
    DW = DW,
    model = estimation_control[["model"]],
    n_o = nlevels(spflow_indicators[[2]]),
    n_d = nlevels(spflow_indicators[[1]]),
    approx_order = estimation_control[["logdet_approx_order"]],
    M_indicator = spflow_indicators2mat(spflow_indicators, "HAS_ZZ"))

  return(list("logdet_calculator" = logdet_calculator,
              "pspace_validator" = pspace_validator))
  }


#' @keywords internal
derive_pspace_validator <- function(
  OW_character,
  DW_character,
  model,
  estimation_method) {

  model_num <- as.numeric(substr(model,7,7))
  if (model_num == 1)
    return(NULL)

  req_OW <- model_num %in% c(3:9)
  req_DW <- model_num %in% c(2,4:9)

  check <- "
  The validity of the the parameter space cannot be validated because the
  neighborhood matrix has complex eigen values."
  has_complex_evs <- function(x) any(Im(x[c("LR","SR")]) != 0)
  if ((req_OW & has_complex_evs(OW_character)) ||
      (req_DW & has_complex_evs(DW_character))) {
    assert(FALSE, check, warn = TRUE)
    return(function(...) FALSE)
  }

  DWmax <- as.numeric(Re(DW_character["LR"])) %T% req_DW
  DWmin <- as.numeric(Re(DW_character["SR"])) %T% req_DW
  OWmax <- as.numeric(Re(OW_character["LR"])) %T% req_OW
  OWmin <- as.numeric(Re(OW_character["SR"])) %T% req_OW
  mod_params <- switch(model, "model_5" = "model_7", "model_6" = "model_9", model)
  rho_names <- define_spatial_lag_params(mod_params)
  rho_scale <- switch(model, "model_5" = 1/2, "model_6" = 1/3, 1)

  WF_eigen_part <- rbind(
    "max_max" = c("rho_d" = DWmax, "rho_o" = OWmax, "rho_w" =  OWmax*DWmax),
    "max_min" = c("rho_d" = DWmax, "rho_o" = OWmin, "rho_w" =  OWmin*DWmax),
    "min_max" = c("rho_d" = DWmin, "rho_o" = OWmax, "rho_w" =  OWmax*DWmin),
    "min_min" = c("rho_d" = DWmin, "rho_o" = OWmin, "rho_w" =  OWmin*DWmin)
    )[,rho_names, drop = FALSE]




  validate_fun <- function(rho){
    rho <- lookup(values = rho * rho_scale, names = rho_names) # trick for model 5 and 6
    WF_eigen_range <- range(rowSums(WF_eigen_part %*% diag(rho, length(rho))))
    valid_upper_bound <- WF_eigen_range[2] < 1
    valid_lower_bound <- WF_eigen_range[1] > -1 | estimation_method == "s2sls"
    return(valid_upper_bound & valid_lower_bound)
  }
  return(validate_fun)
}

#' @keywords internal
validate_networks <- function(
    spflow_networks,
    id_net_pair,
    model,
    do_normalisation) {

  assert_is_single_x(id_net_pair, "character")
  assert_is(spflow_networks, "spflow_network_multi")
  assert(id_net_pair %in% id(spflow_networks)[["pairs"]],
         'The the spflow_network_pair with id "%s" are not available!',
         id_net_pair)

  od_ids <- id(spflow_networks@pairs[[id_net_pair]])

  remove_pairs <- id(spflow_networks)[["pairs"]] != id_net_pair
  remove_nodes <- !id(spflow_networks)[["nodes"]] %in% od_ids[c("orig", "dest")]
  spflow_networks@pairs[remove_pairs] <- NULL
  spflow_networks@nodes[remove_nodes] <- NULL

  check_msg <- "For model_%s the neighborhood of %s must be available!"
  model_num <- as.numeric(substr(model,7,7))
  miss_OW <- model_num %in% c(3:9) & is.null(neighborhood(spflow_networks, od_ids["orig"]))
  miss_DW <- model_num %in% c(2,4:9) & is.null(neighborhood(spflow_networks, od_ids["dest"]))
  assert(!miss_OW, check_msg, model_num, od_ids["orig"])
  assert(!miss_DW, check_msg, model_num, od_ids["dest"])
  return(spflow_networks)
}