R/spflow.R
In spflow: Spatial Econometric Interaction Models

Documented in spflow

#' @title
#' Estimate spatial interaction models that incorporate spatial dependence
#'
#' @description
#' We implement three different estimators of spatial econometric interaction
#' models \insertCite{Dargel2021}{spflow} that allow the user to estimate
#' origin-destination flows with spatial autocorrelation.
#'
#' By default the estimation will include spatial dependence in the dependent
#' variable and the explanatory variables which leads to the spatial Durbin
#' model (SDM) \insertCite{Anselin1988}{spflow}.
#' Moreover, the model includes an additional set of parameters for intra
#' regional flows that start and end in the same geographic site (as proposed
#' by \insertCite{LeSage2009;textual}{spflow}).
#' Both default options can be deactivated via the `flow_control` argument,
#' which gives fine grained control over the estimation.
#'
#'
#' @param flow_formula
#'   A formula specifying the spatial interaction model (for details see
#'   section Formula interface)
#' @param sp_multi_network
#'   A [sp_multi_network()] object that contains information on the origins,
#'   and the destinations and their neighborhood structure
#' @param network_pair_id
#'   A character indicating the id of a [sp_network_pair()] (only relevant if
#'   the [sp_multi_network()] contains multiple `sp_network_pair`-objects:
#'   defaults to the of them)
#' @param flow_control
#'   A list generated by [spflow_control()] that provides fine grained control
#'   over the estimation procedure
#' @return An S4 class of type [spflow_model-class()]
#'
#'
#' @section Details:
#' Our estimation procedures makes use of the matrix formulation introduced by
#' \insertCite{LeSage2008;textual}{spflow} and further developed by
#' \insertCite{Dargel2021;textual}{spflow} to reduce the computational
#' effort and memory requirements.
#' The estimation procedure can be adjusted through the `estimation_method`
#' argument in [spflow_control()].
#'
#' ## Maximum likelihood estimation (MLE)
#' Maximum likelihood estimation is the default estimation procedure.
#' The matrix form estimation in the framework of this model was first
#' developed by \insertCite{LeSage2008;textual}{spflow} and then improved by
#' \insertCite{Dargel2021;textual}{spflow}.
#'
#' ## Spatial two-stage least squares (S2SLS)
#' The S2SLS estimator is an adaptation of the one proposed by
#' \insertCite{Kelejian1998;textual}{spflow}, to the case of origin-destination
#' flows, with up to three neighborhood matrices
#' \insertCite{Dargel2021;textual}{spflow}.
#' A similar estimation is done by \insertCite{Tamesue2016;textual}{spflow}.
#' The user can activate the S2SLS estimation via the `flow_control` argument
#' using the input `spflow_control(estimation_method = "s2sls")`.
#'
#' ## Bayesian Markov Chain Monte Carlo (MCMC)
#' The MCMC estimator is based on the ideas of
#' \insertCite{LeSage2009;textual}{spflow} and incorporates the improvements
#' proposed in \insertCite{Dargel2021;textual}{spflow}.
#' The estimation is based on a tuned Metropolis-Hastings sampler for the
#' auto-regressive parameters, and for the remaining parameters it uses Gibbs
#' sampling.
#' The routine uses 5500 iterations of the sampling procedure and considers the
#' first 2500 as burn-in period.
#' The user can activate the S2SLS estimation via the `flow_control` argument
#' using the input `spflow_control(estimation_method = "mcmc")`.
#'
#' ## Formula interface
#' The function offers a formula interface adapted to spatial interaction
#' models, which has the following structure:
#' `Y ~ O_(X1) + D_(X2) + I_(X3) + G_(X4)`
#' This structure reflects the different data sources involved in such a model.
#' On the left hand side there is the independent variable `Y` which
#' corresponds to the vector of flows.
#' On the right hand side we have all the explanatory variables.
#' The functions `O_(...)` and `D_(...)` indicate which variables are used as
#' characteristics of the origins and destinations respectively.
#' Similarly, `I_(...)` indicates variables that should be used for the
#' intra-regional parameters.
#' Finally, `G_(...)` declares which variables describe origin-destination
#' pairs, which most frequently will include a measure of distance.
#'
#' All the declared variables must be available in the provided
#' [sp_multi_network()] object, which gathers information on the origins and
#' destinations (inside [sp_network_nodes()] objects), as well as the
#' information on the origin-destination pairs (inside a [sp_network_pair()]
#' object).
#'
#' Using the short notation `Y ~ .` is possible and will be interpreted as
#' usual, in the sense that we use all variables that are available for each
#' data source.
#' Also mixed formulas, such as `Y ~ . + G_(log(X4) + 1)`, are possible.
#' When the dot shortcut is combined with explicit declaration, it will only be
#' used for the non declared data sources.
#' The following examples illustrate this behaviour.
#'
#' ## Formula interface (examples)
#'
#' Consider the case where we have the flow vector `Y` and the distance vector
#' `DIST` available as information on origin-destination pairs.
#' In addition we have  the explanatory variables `X1, X2` and `X3` which
#' describe the regions that are at the same time origins and destinations of
#' the flows.
#'
#' For this example the four formulas below are equivalent and make use of all
#' explanatory variables `X1, X2` and `X3` for origins, destinations and
#' intra-regional observations.
#'
#' - `Y ~ .`
#' - `Y ~ . + G_(DIST)`
#' - `Y ~ X1 + X2 + X3 + G_(DIST)`
#' - `Y ~ D_(X1 + X2 + X3) + O_(X1 + X2 + X3) + I_(X1 + X2 + X3)  + G_(DIST)`
#'
#' Now if we only want to use X1 for the intra-regional model we can do the
#' following (again all four options below are equivalent).
#'
#' - `Y ~ . + I_(X1)`
#' - `Y ~ . + I_(X1) + G_(DIST)`
#' - `Y ~  X1 + X2 + X3 + I_(X1) + G_(DIST)`
#' - `Y ~ D_(X1 + X2 + X3) + O_(X1 + X2 + X3) + I_(X1)  + G_(DIST)`
#'
#' This behaviour is easily combined with transformation of variables as the
#' two equivalent options below illustrate.
#'
#' - `log(Y + 1) ~ sqrt(X1) +  X2 + G_(log(DIST + 1))`
#' @examples
#'
#' # Estimate flows between the states of Germany
#' spflow(flow_formula = y9 ~ . + G_(DISTANCE),
#'        sp_multi_network = multi_net_usa_ge,
#'        network_pair_id = "ge_ge")
#'
#' # Same as above with explicit declaration of variables...
#' # ... X is the only variable available
#' # ... it is used for origins, destination and intra-state flows
#' spflow(flow_formula = y9 ~ X + G_(DISTANCE),
#'        sp_multi_network = multi_net_usa_ge,
#'        network_pair_id = "ge_ge")
#'
#' # Same as above
#' spflow(flow_formula = y9 ~ O_(.) + D_(.) + I_(.) + G_(DISTANCE),
#'        sp_multi_network = multi_net_usa_ge,
#'        network_pair_id = "ge_ge")
#'
#' # Same as above
#' spflow(flow_formula = y9 ~ O_(X) + D_(X) + I_(X) + G_(DISTANCE),
#'        sp_multi_network = multi_net_usa_ge,
#'        network_pair_id = "ge_ge")
#'
#'
#' @references \insertAllCited{}
#' @seealso [spflow_control()] [spflow_network_classes()]
#' @export
spflow <- function(
  flow_formula,
  sp_multi_network,
  network_pair_id = id(sp_multi_network)[["network_pairs"]][[1]],
  flow_control = spflow_control()
) {

  ## check for abusive inputs and correct ids
  assert_is(flow_formula,"formula")
  assert_is(sp_multi_network,"sp_multi_network")

  pair_ids <- id(sp_multi_network)[["network_pairs"]]
  assert(valid_network_pair_id(network_pair_id),
         "The network_pair_id must be a character of length 1!")
  assert(network_pair_id %in% pair_ids,
         'The the network pair id "%s" is not available!',
         network_pair_id)

  ## validate (by calling again) then enrich control parameters
  flow_control <- do.call("spflow_control", flow_control)
  estim_control <- c(
    flow_control,
    list("model_type" = sp_model_type(flow_control)),
    matrix_form_control(pull_member(sp_multi_network, network_pair_id)))

  # below cases should become available in future versions of the package
  assert(is.null(estim_control$weight_variable), warn = TRUE,
         "Weighting of observations is not yet possible and will be ignored.")
  assert(estim_control$mat_complet == 1,
         "Estimation is (for now) only possible if the number of pairs is " %p%
         "excatly the number of origins multiplied by the number of " %p%
         "destinations!")
  assert(estim_control$mat_within,
         "Estimation of flows between two diffrent networks are " %p%
         "not yet available!")

  formula_parts <- interpret_flow_formula(
    flow_formula,
    flow_control)

  model_matrices <- spflow_model_matrix(
    sp_multi_network,
    network_pair_id,
    formula_parts,
    estim_control)

  model_moments <- spflow_model_moments(
    model_matrices = model_matrices,
    estim_control = estim_control)

  estimation_results <- spflow_model_estimation(
    model_moments,
    estim_control)

  estimation_results <- add_details(
    estimation_results,
    model_matrices = model_matrices,
    flow_control = flow_control,
    model_moments = model_moments)

  return(estimation_results)
}


#' @keywords internal
parameter_names <- function(
  model_matrices,
  model) {

  names_rho <- define_spatial_lag_params(model)
  names_const <- c("(Intercept)", "(Intra)")
  use_const <- c(model_matrices$constants$global == 1,
                 !is.null(model_matrices$constants$intra$In))
  names_const <- names_const[use_const]

  x_prefs <- list("D_" = "DEST_","O_" = "ORIG_","I_" = "INTRA_")
  names_X <- lapply(compact(model_matrices[names(x_prefs)]), "colnames")
  names_X <- Map("%p%", x_prefs[names(names_X)], names_X)
  names_X <- unlist(names_X, use.names = FALSE)

  names_G <- names(model_matrices$G)
  export_names <- c(names_rho,names_const,names_X,names_G)

  return(export_names)

}

#' @keywords internal
drop_instruments <- function(model_matrices) {

  # ... from constants
  filter_inst <- function(x) Filter(function(x) !attr_inst_status(x), x)
  constants <- list(
    "global" = model_matrices$constants$global,
    "intra" = filter_inst(model_matrices$constants$intra))

  # ... from site attributes
  filter_inst_col <- function(mat) mat[,!attr_inst_status(mat), drop = FALSE]
  vector_treatment <- c("D","O","I") %p% "_"
  matrices_X <- lapply(compact(model_matrices[vector_treatment]),
                       "filter_inst_col")

  # ... from pair attributes
  matrices_G <- lapply(model_matrices["G_"], "filter_inst")

  # ... combine cleaned versions
  matrices_and_spatial_weights <- c(
    model_matrices["Y_"],
    list("constants" = constants),
    matrices_X,
    matrices_G,
    compact(model_matrices[c("DW","OW")])
  )

  return(matrices_and_spatial_weights)
}

#' @keywords internal
sp_model_type <- function(cntrl) {
  has_lagged_x <- cntrl$sdm_variables != "none"

  if (cntrl$model == "model_1")
    model_type <- ifelse(has_lagged_x,"SLX","OLM")

  if (cntrl$model != "model_1")
    model_type <- ifelse(has_lagged_x,"SDM","LAG")

  return(model_type)
}