R/gergm.R

Defines functions gergm

Documented in gergm

#' @title A Function to estimate a GERGM.
#' @description The main function provided by the package.
#'
#' @param formula A formula object that specifies the relationship between
#' statistics and the observed network. Currently, the user may specify a model
#' using any combination of the following statistics: `out2stars(alpha = 1)`,
#' `in2stars(alpha = 1)`, `ctriads(alpha = 1)`, `mutual(alpha = 1)`,
#' `ttriads(alpha = 1)`, `absdiff(covariate = "MyCov")`,
#'  `sender(covariate = "MyCov")`,
#' `reciever(covariate = "MyCov")`, `nodematch(covariate)`,
#' `nodemix(covariate, base = "MyBase")`, `netcov(network)` and
#' `edges(alpha = 1, method = c("regression","endogenous"))`. If the user
#' specifies `nodemix(covariate, base = NULL)`, then all levels of the
#' covariate will be matched on. Note that the `edges` term must be specified if
#' the user wishes to include an intercept
#' (strongly recommended). The user may select the "regression" method
#' (default) to include an intercept in the lambda transformation of the
#' network, or "endogenous" to include the intercept as in a traditional ERGM
#' model.  To use exponential down-weighting for any of the network level terms,
#' simply specify a value for alpha less than 1. The `(alpha = 1)` term may be
#' omitted from the structural terms if no exponential down weighting is
#' required. In this case, the terms may be provided as: `out2star`, `in2star`,
#' `ctriads`, `mutual`, `ttriads`. If the network is undirected the user may only
#' specify the following terms: `twostars(alpha = 1)`,  `ttriads(alpha = 1)`,
#' `absdiff(covariate = "MyCov")`,
#' `sender(covariate = "MyCov")`, `nodematch(covariate)`,
#' `nodemix(covariate, base = "MyBase")`, `netcov(network)` and
#' `edges(alpha = 1, method = c("regression","endogenous"))`. In some cases,
#' the user may only wish to calculate endogenous statistics for edges between
#' some subset of the nodes in the network. For each of the endogenous
#' statistics, the user may optionally specify a `covariate` and `base` field
#' such as in `in2stars(covariate = "Type", base = "C")`. This will add an
#' in-2star statistic for the subnetwork defined by actors who match each level
#' of the categorical variable "Type" (in this example), and exclude the
#' subnetwork for type "C", if the `base` argument is provided. If the `base`
#' argument is excluded, then terms will be added to the model for all levels of
#' the statistic. This can be a useful option if the user believes that a
#' network property varies with some property of nodes.
#' @param covariate_data A data frame containing node level covariates the user
#' wished to transform into sender or receiver effects. It must have row names
#' that match every entry in colnames(raw_network), should have descriptive
#' column names.  If left NULL, then no sender or receiver effects will be
#' added.
#' @param beta_correlation_model Defaults to FALSE. If TRUE, then the beta
#' correlation model is estimated. A correlation network must be provided, but
#' all covariates and undirected statistics may be supplied as normal.
#' @param distribution_estimator Provides an option to estimate the structure of
#' row-wise marginal and joint distribtuions using a uniform-dirichlet proposal
#' distribution. THIS FEATURE IS EXPERIMENTAL. Defaults to "none", in which case
#' a normal GERGM is estimated, but can be set to one of "rowwise-marginal" and
#' "joint" to propose either row-wsie marginal distribtuions or joint
#' distributions. If an option other than "none" is selected, the beta correlation
#' model will be turned off, estimation will automatically be set to Metropolis,
#' no covariate data will be allowed, and the network will be set to
#' directed. Furthermore, a "diagonal" statistic will be added to the model which
#' simply records the sum of the diagonal of the network. The "mutual" statistic
#' will also be adapted to include the diagonal elements. In the future, more
#' statistics which take account of the network diagonal will be included.
#' @param network_is_directed Logical specifying whether or not the observed
#' network is directed. Default is TRUE.
#' @param include_diagonal Logical indicating whether the diagonal should be
#' included in the estimation proceedure. If TRUE, then a "diagonal" statistic
#' is added to the model. Defaults to FALSE.
#' @param number_of_networks_to_simulate Number of simulations generated for
#' estimation via MCMC. Default is 500.
#' @param MCMC_burnin Number of samples from the MCMC simulation procedure that
#' will be discarded before drawing the samples used for estimation.
#' Default is 100.
#' @param thin The proportion of samples that are kept from each simulation. For
#' example, thin = 1/200 will keep every 200th network in the overall simulated
#' sample. Default is 1.
#' @param proposal_variance The variance specified for the Metropolis Hastings
#' simulation method. This parameter is inversely proportional to the average
#' acceptance rate of the M-H sampler and should be adjusted so that the average
#' acceptance rate is approximately 0.25. Default is 0.1.
#' @param target_accept_rate The target Metropolis Hastings acceptance rate.
#' Defaults to 0.25
#' @param seed Seed used for reproducibility. Default is 123.
#' @param hyperparameter_optimization Logical indicating whether automatic
#' hyperparameter optimization should be used. Defaults to FALSE. If TRUE, then
#' the algorithm will automatically seek to find an optimal burnin and number of
#' networks to simulate, and if using Metropolis Hasings, will attempt to select
#' a proposal variance that leads to a acceptance rate within +-0.05 of
#' target_accept_rate. Furthermore, if degeneracy is detected, the algorithm
#' will attempt to adress the issue automatically. WARNING: This feature is
#' experimental, and may greatly increase runtime. Please monitor console
#' output!
#' @param convex_hull_proportion Defaults to 0.9. Must be a number
#' between 0 and 1, with values between 0.5 and 0.9 preferred. Setting a value
#' for this parameter makes use of the initialization method introduced by
#' Hummel, Hunter and Handcock (2012), which can provide a much more stable
#' initialization method than MPLE. Selecting this method will not supercede
#' other initialization methods, as it will be run after MPLE or grid search.
#' However, in practice, it should often preclude the need for other
#' initialization methods. If NULL, then standard MPLE an grid search methods
#' may be used. Not recommended
#' @param convex_hull_convergence_proportion Defaults to 0.9. This must be a
#' number between 0 and 1, with values between 0.7 and 0.95 preferred. This
#' parameter controls the stopping behavior of the convex hull initialization
#' proceedure, with a higher value requiring a better fit before moving to
#' standard estimation.
#' @param sample_edges_at_a_time Defaults to 0. If greater than zero, then this
#' is the number of edges to be updated at once during MCMCMLE. The lower this
#' number is set, the higher the Metropolis Hastings acceptance rate should be.
#' This option will primarily be relevant for large networks.
#' @param parallel Logical indicating whether the weighted MPLE objective and any
#' other operations that can be easily parallelized should be calculated in
#' parallel. Defaults to FALSE. If TRUE, a significant speedup in computation
#' may be possible.
#' @param parallel_statistic_calculation Logical indicating whether network
#' statistics should be calculated in parallel. This will tend to be slower for
#' networks with les than ~30 nodes but may provide a substantial speedup for
#' larger networks.
#' @param cores Numeric value defaulting to 1. Can be set to any number up to the
#' number of threads/cores available on your machine. Will be used to speed up
#' computations if parallel = TRUE.
#' @param use_stochastic_MH A logical indicating whether a stochastic approximation
#' to the h statistics should be used under Metropolis Hastings in-between
#' thinned samples. This may dramatically speed up estimation. Defaults to FALSE.
#' HIGHLY EXPERIMENTAL!
#' @param stochastic_MH_proportion Percentage of dyads/triads to use for
#' approximation, defaults to 0.25.
#' @param slackr_integration_list An optional list object that contains
#' information necessary to provide updates about model fitting progress to a
#' Slack channel (https://slack.com/). This can be useful if models take a long
#' time to run, and you wish to receive updates on their progress (or if they
#' become degenerate). The list object must be of the following form:
#' list(model_name = "descriptive model name", channel = "#yourchannelname",
#'  incoming_webhook_url = "https://hooks.slack.com/services/XX/YY/ZZ"). You
#'  will need to set up incoming webhook integration for your slack channel and
#'  then paste in the URL you get from slack into the incoming_webhook_url field.
#'  If all goes well, and the computer you are running the GERGM estimation on
#'  has internet access, your slack channel will receive updates when you start
#'  estimation, after each lambda/theta parameter update, if the model becomes
#'  degenerate, and when it completes running.
#' @param convergence_tolerance Threshold designated for stopping criterion. If
#' the difference of parameter estimates from one iteration to the next all have
#' a p -value (under a paired t-test) greater than this value, the parameter
#' estimates are declared to have converged. Default is 0.5, which is quite
#' conservative.
#' @param MPLE_gain_factor Multiplicative constant between 0 and 1 that controls
#' how far away the initial theta estimates will be from the standard MPLEs via
#' a one step Fisher update. In the case of strongly dependent data, it is
#' suggested to use a value of 0.10. Default is 0.
#' @param acceptable_fit_p_value_threshold A p-value threshold for how closely
#' statistics of observed network conform to statistics of networks simulated
#' from GERGM parameterized by converged final parameter estimates. Default value
#' is 0.05.
#' @param normalization_type If only a raw_network is provided the function
#' will automatically check to determine if all edges fall in the [0,1] interval.
#' If edges are determined to fall outside of this interval, then a trasformation
#' onto the interval may be specified. If "division" is selected, then the data
#' will have a value added to them such that the minimum value is at least zero
#' (if necessary) and then all edge values will be divided by the maximum to
#' ensure that the maximum value is in [0,1]. If "log" is selected, then the data
#' will have a value added to them such that the minimum value is at least zero
#' (if necessary), then 1 will be added to all edge values before they are logged
#' and then divided by the largest value, again ensuring that the resulting
#' network is on [0,1]. Defaults to "log" and need not be set to NULL if
#' providing covariates as it will be ignored.
#' @param transformation_type Specifies how covariates are transformed onto the
#' raw network. When working with heavy tailed data that are not strictly
#' positive, select "Cauchy" to transform the data using a Cauchy distribution.
#' If data are strictly positive and heavy tailed (such as financial data) it is
#' suggested the user select "LogCauchy" to perform a Log-Cauchy transformation
#' of the data. For a tranformation of the data using a Gaussian distribution,
#' select "Gaussian" and for strictly positive raw networks, select "LogNormal".
#' The Default value is "Cauchy".
#' @param estimation_method Simulation method for MCMC estimation. Default is
#' "Metropolis", which allows for the most flexible model specifications, but
#' may also be set to "Gibbs", if the user wishes to use Gibbs sampling.
#' @param downweight_statistics_together Logical specifying whether or not the
#' weights should be applied inside or outside the sum. Default is TRUE and user
#' should not select FALSE under normal circumstances.
#' @param force_x_theta_updates Defaults to 1 where theta estimation is not
#' allowed to converge until thetas have updated for x iterations . Useful when
#' model is not degenerate but simulated statistics do not match observed network
#' well when algorithm stops after first y updates.
#' @param force_x_lambda_updates Defaults to 1 where lambda estimation is not
#' allowed to converge until lambdas have updated for x iterations . Useful when
#' model is not degenerate but simulated statistics do not match observed network
#' well when algorithm stops after first y updates.
#' @param output_directory The directory where you would like output generated
#' by the GERGM estimation procedure to be saved (if output_name is specified).
#' This includes, GOF, trace, and parameter estimate plots, as well as a summary
#' of the estimation procedure and an .Rdata file containing the GERGM object
#' returned by this function. May be left as NULL if the user would prefer all
#' plots be printed to the graphics device.
#' @param output_name The common name stem you would like to assign to all
#' objects output by the gergm() function. Default value of NULL will not save any
#' output directly to .pdf files, it will be printed to the console instead. Must
#' be a character string or NULL. For example, if "Test" is supplied as the
#' output_name, then 4 files will be output: "Test_GOF.pdf", "Test_Parameter_Estim
#' ates.pdf", "Test_GERGM_Object.Rdata", "Test_Estimation_Log.txt", and
#' "Test_Trace_Plot.pdf"
#' @param generate_plots Defaults to TRUE, if FALSE, then no diagnostic or
#' parameter plots are generated.
#' @param verbose Defaults to TRUE (providing lots of output while model is
#' running). Can be set to FALSE if the user wishes to see less output.
#' @param use_previous_thetas Logical, defaults to TRUE. IF TRUE, then at each
#' iteration of covariate parameter updates beyond the first, the MPLE
#' initialization for theta parameters will be skipped, and the previous
#' iteration thetas will be used as a starting point. This option will only work
#' with convex hull initialization. The intuition here is that if MPLE is poor,
#' the previous theta estimates may be a better starting point for convex hull
#' initialization. In some cases this can lead to a very large speedup in
#' estimation. However, this can lead to arbitrarily poor performance in
#' some situations.
#' @param fine_grained_pv_optimization Logical indicating whether fine grained
#' proposal variance optimization should be used. This will often slow down
#' proposal variance optimization, but may provide better results. Highly
#' recommended if running a correlation model.
#' @param use_MPLE_only Logical specifying whether or not only the maximum pseudo
#' likelihood estimates should be obtained. In this case, no simulations will be
#' performed. Default is FALSE.
#' @param start_with_zeros Defaults to FALSE. IF TRUE, then no MPLE is used and
#' allendogenous parameters are initialized to zero. This method will still work
#' with convex_hull_proportion.
#' @param stop_for_degeneracy When TRUE, automatically stops estimation when
#' degeneracy is detected, even when hyperparameter_optimization is set to TRUE.
#' Defaults to FALSE.
#' @param maximum_number_of_lambda_updates Maximum number of iterations of outer
#' MCMC loop which alternately estimates transform parameters and ERGM
#' parameters. In the case that data_transformation = NULL, this argument is
#' ignored. Default is 10.
#' @param maximum_number_of_theta_updates Maximum number of iterations within the
#' MCMC inner loop which estimates the ERGM parameters. Default is 100.
#' @param user_specified_initial_thetas Optional numeric vector of user specified
#' theta values to be used for initialization (instead of MPLE). This option
#' should not be used exept when MPLE performance is poor. Defaults to NULL.
#' @param integration_intervals The number of intervals to be used for numerical
#' integration in weighted MPLE and in MPLE for the distribution estimator.
#' Defaults to 150 but may be increased if more accuracy is required.
#' @param distribution_mple_regularization_weight L2 regularization of the MPLE
#' theta estimates may be necessary to provide an adequate starting position
#' when using the distribtuion estimator. We suggest a value of 0.05, but the
#' optimal L2 penalty will be application specific. Setting to zero removes all
#' L2 regularization.
#' @param theta_grid_optimization_list Defaults to NULL. This highly
#' experimental feature may allow the user to address model degeneracy arising
#' from a suboptimal theta initialization. It performs a grid search around the
#' theta values calculated via MPLE to select a potentially improved
#' initialization. The runtime complexity of this feature grows exponentially in
#' the size of the grid and number of parameters -- use with great care. This
#' feature may only be used if hyperparameter_optimization = TRUE, and if a list
#' object of the following form is provided: list(grid_steps = 2,
#' step_size = 0.5, cores = 2, iteration_fraction = 0.5). grid_steps indicates
#' the number of steps out the grid search will perform, step_size indicates the
#' fraction of the MPLE theta estimate that each grid search step will change by,
#' cores indicates the number of cores to be used for parallel optimization, and
#' iteration_fraction indicates the fraction of the number of MCMC iterations
#' that will be used for each grid point (should be set less than 1 to speed up
#' optimization). In general grid_steps should be smaller the more structural
#' parameters the user wishes to specify. For example, with 5 structural
#' parameters (mutual, ttriads, etc.), grid_steps = 3 will result in a (2*3+1)^5
#' = 16807 parameter grid search. Again this feature is highly experimental and
#' should only be used as a last resort (after playing with exponential
#' down weighting and the MPLE_gain_factor).
#' @param weighted_MPLE Defaults to FALSE. Should be used whenever the user is
#' specifying statistics with alpha down weighting. Tends to provide better
#' initialization when downweight_statistics_together = FALSE.
#' @param estimate_model Logical indicating whether a model should be estimated.
#' Defaults to TRUE, but can be set to FALSE if the user simply wishes to return
#' a GERGM object containing the model specification. Useful for debugging.
#' @param optimization_method The optimization method used by the 'optim'
#' function in estimating theta and lambda parameter estimates. Defualts to
#' "BFGS", but can also be any one of "L-BFGS-B", "Nelder-Mead", "CG", "SANN",
#' or "Brent". "L-BFGS-B" is preferred for fitting beta correlation models.
#' @param ... Optional arguments, currently unsupported.
#' @return A gergm object containing parameter estimates.
#' @examples
#' \dontrun{
#' set.seed(12345)
#' net <- matrix(rnorm(100,0,20),10,10)
#' colnames(net) <- rownames(net) <- letters[1:10]
#' formula <- net ~  mutual + ttriads
#'
#' test <- gergm(formula,
#'               normalization_type = "division",
#'               network_is_directed = TRUE,
#'               use_MPLE_only = FALSE,
#'               estimation_method = "Metropolis",
#'               number_of_networks_to_simulate = 40000,
#'               thin = 1/10,
#'               proposal_variance = 0.5,
#'               downweight_statistics_together = TRUE,
#'               MCMC_burnin = 10000,
#'               seed = 456,
#'               convergence_tolerance = 0.01,
#'               MPLE_gain_factor = 0,
#'               force_x_theta_updates = 4)
#' }
#' @export
gergm <- function(formula,
                  covariate_data = NULL,
                  beta_correlation_model = FALSE,
                  distribution_estimator = c("none","rowwise-marginal","joint"),
                  network_is_directed = TRUE,
                  include_diagonal = FALSE,
                  number_of_networks_to_simulate = 500,
                  MCMC_burnin = 100,
                  thin = 1,
                  proposal_variance = 0.1,
                  target_accept_rate = 0.25,
                  seed = 123,
                  hyperparameter_optimization = FALSE,
                  convex_hull_proportion = 0.9,
                  convex_hull_convergence_proportion = 0.9,
                  sample_edges_at_a_time = 0,
                  parallel = FALSE,
                  parallel_statistic_calculation = FALSE,
                  cores = 1,
                  use_stochastic_MH = FALSE,
                  stochastic_MH_proportion = 0.25,
                  slackr_integration_list = NULL,
                  convergence_tolerance = 0.5,
                  MPLE_gain_factor = 0,
                  acceptable_fit_p_value_threshold = 0.05,
                  normalization_type = c("log","division"),
                  transformation_type = c("Cauchy","LogCauchy","Gaussian","LogNormal"),
                  estimation_method = c("Metropolis","Gibbs"),
                  downweight_statistics_together = TRUE,
                  force_x_theta_updates = 1,
                  force_x_lambda_updates = 1,
                  output_directory = NULL,
                  output_name = NULL,
                  generate_plots = TRUE,
                  verbose = TRUE,
                  use_previous_thetas = TRUE,
                  fine_grained_pv_optimization = FALSE,
                  use_MPLE_only = c(FALSE, TRUE),
                  start_with_zeros = FALSE,
                  stop_for_degeneracy = FALSE,
                  maximum_number_of_lambda_updates = 10,
                  maximum_number_of_theta_updates = 10,
                  user_specified_initial_thetas = NULL,
                  integration_intervals = 150,
                  distribution_mple_regularization_weight = 0.05,
                  theta_grid_optimization_list = NULL,
                  weighted_MPLE = FALSE,
                  estimate_model = TRUE,
                  optimization_method = c("BFGS", "L-BFGS-B", "Nelder-Mead", "CG", "SANN", "Brent"),
                  ...
                  ){

  # pass in experimental features through elipsis
  object <- as.list(substitute(list(...)))[-1L]

  # record start time for estimation
  start_time <- Sys.time()

  # hard coded possible stats
  possible_structural_terms <- c("out2stars",
                                 "in2stars",
                                 "ctriads",
                                 "mutual",
                                 "ttriads",
                                 "edges",
                                 "diagonal",
                                 "wtdegree",
                                 "bcent",
                                 "stbcent")
  possible_structural_terms_undirected <- c("twostars",
                                            "ttriads",
                                            "edges",
                                            "diagonal",
                                            "wtdegree",
                                            "bcent",
                                            "stbcent")
  possible_covariate_terms <- c("absdiff",
                                "nodecov",
                                "nodematch",
                                "sender",
                                "receiver",
                                "intercept",
                                "nodemix")
  possible_network_terms <- "netcov"
  possible_transformations <- c("cauchy",
                                "logcauchy",
                                "gaussian",
                                "lognormal")


  # set logical values for whether we are using MPLE only, whether the network
  # is directed, and which estimation method we are using as well as the
  # transformation type
  use_MPLE_only <- use_MPLE_only[1] #default is FALSE
  network_is_directed <- network_is_directed[1] #default is TRUE
  estimation_method <- estimation_method[1] #default is MH
  transformation_type <- transformation_type[1] #default is "Cauchy"
  transformation_type <- tolower(transformation_type)
  normalization_type <- normalization_type[1]
  distribution_estimator <- distribution_estimator[1]
  using_distribution_estimator <- FALSE
  optimization_method <- optimization_method[1]

  # deal with the case where we are using a distribution estimator
  if (distribution_estimator %in%  c("none","rowwise-marginal","joint")) {
    # if we are actually using the distribution estimator
    if (distribution_estimator != "none") {
      # perform checks and set variables so that they are at their correct values.
      cat("Making sure all options are set properly for use with the",
          "distribution estimator... \n")
      use_MPLE_only <- FALSE
      estimation_method <- "Metropolis"
      covariate_data <- NULL
      network_is_directed <- TRUE
      normalization_type <- "division"
      beta_correlation_model <- FALSE
      using_distribution_estimator <- TRUE
      include_diagonal <- TRUE
    }
  } else {
    stop("distribution_estimator must be one of 'none','rowwise-marginal', or 'joint'")
  }

  # add a diagonal term if requested.
  if (include_diagonal) {
    if (estimation_method == "Gibbs") {
      cat("Gibbs sampling is currently not supported when include_diagonal == TRUE, changing to 'Metropolis'")
      estimation_method <- "Metropolis"
    }
    formula <- add_diagonal_term(formula)
  }

  # set the number of threads to use with parallel
  if (parallel) {
    RcppParallel::setThreadOptions(numThreads = cores)
  }

  # if we are using a correlation network, then the network must be undirected.
  if (beta_correlation_model) {
    if (network_is_directed) {
      cat("Setting network_is_directed to FALSE for correlation network...\n")
    }
    network_is_directed <- FALSE
  }

  if (network_is_directed) {
    possible_structural_term_indices <- 1:10
  } else {
    possible_structural_term_indices <- c(2,5,6,7,8,9,10)
  }

  # check terms for undirected network
  if (!network_is_directed) {
    formula <- parse_undirected_structural_terms(
      formula,
      possible_structural_terms,
      possible_structural_terms_undirected)
  }

  # check to see if we are using a regression intercept term, and if we are,
  # then add the "intercept" field to the formula.
  check_for_edges <- Parse_Formula_Object(formula,
                               possible_structural_terms,
                               possible_covariate_terms,
                               possible_network_terms,
                               using_distribution_estimator,
                               raw_network = NULL,
                               theta = NULL,
                               terms_to_parse = "structural",
                               covariate_data = covariate_data)

  # automatically add an intercept term if necessary
  if (check_for_edges$include_intercept) {
    if (check_for_edges$regression_intercept) {
      formula <- add_intercept_term(formula)
    }
  }

  if (is.null(output_directory) & !is.null(output_name)) {
    stop("You have specified an output file name but no output directory. Please
         specify both or neither.")
  }

  if (length(which(possible_transformations %in% transformation_type  == T)) != 1) {
    stop("You have specified a transformation that is not recognized. Please
         specify one of: Cauchy, LogCauchy, Gaussian, or LogNormal")
  }

  #make sure proposal variance is greater than zero
  if (proposal_variance <= 0) {
    stop("You supplied a proposal variance that was less than or equal to zero.")
  }

  # make sure than thin is always less than one, if not, just take its reciprocal.
  if (thin > 1) {
    thin <- 1/thin
  }

  formula <- as.formula(formula)

  #0. Prepare the data
  Transformed_Data <- Prepare_Network_and_Covariates(
     formula,
     possible_structural_terms,
     possible_covariate_terms,
     possible_network_terms,
     covariate_data = covariate_data,
     normalization_type = normalization_type,
     is_correlation_network = FALSE,
     is_directed = network_is_directed,
     beta_correlation_model = beta_correlation_model,
     include_diagonal = include_diagonal)

  data_transformation <- NULL
  if (!is.null(Transformed_Data$transformed_covariates)) {
    data_transformation <- Transformed_Data$transformed_covariates
  }
  gpar.names <- c(Transformed_Data$gpar.names, "dispersion")

  #1. Create GERGM object from network

  GERGM_Object <- Create_GERGM_Object_From_Formula(
     formula,
     theta.coef = NULL,
     possible_structural_terms,
     possible_covariate_terms,
     possible_network_terms,
     using_distribution_estimator,
     raw_network = Transformed_Data$network,
     together = 1,
     transform.data = data_transformation,
     lambda.coef = NULL,
     transformation_type = transformation_type,
     is_correlation_network = FALSE,
     is_directed = network_is_directed,
     beta_correlation_model = beta_correlation_model,
     covariate_data = covariate_data,
     possible_structural_terms_undirected = possible_structural_terms_undirected)

  GERGM_Object@theta_estimation_converged <- FALSE
  GERGM_Object@lambda_estimation_converged <- FALSE
  GERGM_Object@observed_network  <- GERGM_Object@network
  GERGM_Object@observed_bounded_network <- GERGM_Object@bounded.network
  GERGM_Object@simulation_only <- FALSE
  GERGM_Object@transformation_type <- transformation_type
  GERGM_Object@downweight_statistics_together <- downweight_statistics_together
  GERGM_Object@directed_network <- network_is_directed
  # only add in list if not NULL
  GERGM_Object@using_grid_optimization <- FALSE
  if (class(theta_grid_optimization_list) == "list") {
    GERGM_Object@using_grid_optimization <- TRUE
    GERGM_Object@theta_grid_optimization_list <- theta_grid_optimization_list
  }
  
  if (!is.null(data_transformation)) {
    GERGM_Object@data_transformation <- data_transformation
  }

  if (is.null(output_name)) {
    GERGM_Object@print_output <- FALSE
  }else{
    GERGM_Object@print_output <- TRUE
  }

  # if we are using a correlation network then set field to TRUE.
  GERGM_Object@is_correlation_network <- FALSE # deprecated
  GERGM_Object@beta_correlation_model <- beta_correlation_model
  GERGM_Object@distribution_estimator <- distribution_estimator
  GERGM_Object@include_diagonal <- include_diagonal
  GERGM_Object@user_specified_initial_thetas
  GERGM_Object@use_user_specified_initial_thetas <- FALSE
  if (!is.null(user_specified_initial_thetas)) {
    GERGM_Object@user_specified_initial_thetas <- user_specified_initial_thetas
    GERGM_Object@use_user_specified_initial_thetas <- TRUE
  }

  # record the various optimizations we are using so that they can be used in
  # the main algorithm
  GERGM_Object@weighted_MPLE <- weighted_MPLE
  GERGM_Object@integration_intervals <- integration_intervals
  GERGM_Object@regularization_weight <- distribution_mple_regularization_weight
  GERGM_Object@fine_grained_pv_optimization <- fine_grained_pv_optimization
  GERGM_Object@parallel <- parallel
  GERGM_Object@parallel_statistic_calculation <- parallel_statistic_calculation
  GERGM_Object@cores <- cores
  GERGM_Object@use_stochastic_MH <- use_stochastic_MH
  GERGM_Object@stochastic_MH_proportion <- stochastic_MH_proportion
  GERGM_Object@possible_endogenous_statistic_indices <- possible_structural_term_indices

  # set adaptive metropolis parameters
  GERGM_Object@hyperparameter_optimization <- hyperparameter_optimization
  GERGM_Object@target_accept_rate <- target_accept_rate
  GERGM_Object@proposal_variance <- proposal_variance
  GERGM_Object@estimation_method <- estimation_method
  GERGM_Object@number_of_simulations <- number_of_networks_to_simulate
  GERGM_Object@thin <- thin
  GERGM_Object@burnin <- MCMC_burnin
  GERGM_Object@MPLE_gain_factor <- MPLE_gain_factor
  GERGM_Object@start_time <- toString(start_time)
  GERGM_Object@start_with_zeros <- start_with_zeros
  GERGM_Object@sample_edges_at_a_time <- sample_edges_at_a_time

  if (is.null(convex_hull_proportion)) {
    GERGM_Object@convex_hull_proportion <- -1
  } else {
    GERGM_Object@convex_hull_proportion <- convex_hull_proportion
  }

  GERGM_Object@convex_hull_convergence_proportion <-  convex_hull_convergence_proportion

  GERGM_Object@optimization_method <- optimization_method
  GERGM_Object@use_previous_thetas <- use_previous_thetas



  GERGM_Object@using_slackr_integration <- FALSE
  if (!is.null(slackr_integration_list)) {
    if (length(slackr_integration_list) != 3) {
      stop("slackr_integration_list must contain three fields, see documentation...")
    }
    GERGM_Object@slackr_integration_list <- slackr_integration_list
    GERGM_Object@using_slackr_integration <- TRUE

    start_message <- paste('Starting GERGM Estimation at',toString(start_time))
    specification <- paste('Specification:',toString(formula))
    specification <- gsub("\"", "", specification, fixed=TRUE)

    slackr::slackr_bot(
      start_message,
      specification,
      channel = GERGM_Object@slackr_integration_list$channel,
      username = GERGM_Object@slackr_integration_list$model_name,
      incoming_webhook_url = GERGM_Object@slackr_integration_list$incoming_webhook_url)
  }

  # prepare auxiliary data
  GERGM_Object@statistic_auxiliary_data <- prepare_statistic_auxiliary_data(
    GERGM_Object)

  #GERGM_Object@statistic_auxiliary_data$triples
  #GERGM_Object@statistic_auxiliary_data$pairs
  # record statistics on observed and bounded network
  h.statistics1 <- calculate_h_statistics(
    GERGM_Object,
    GERGM_Object@statistic_auxiliary_data,
    all_weights_are_one = FALSE,
    calculate_all_statistics = TRUE,
    use_constrained_network = FALSE)
  h.statistics2 <- calculate_h_statistics(
    GERGM_Object,
    GERGM_Object@statistic_auxiliary_data,
    all_weights_are_one = FALSE,
    calculate_all_statistics = TRUE,
    use_constrained_network = TRUE)
  statistic_values <- rbind(h.statistics1, h.statistics2)
  colnames(statistic_values) <- GERGM_Object@full_theta_names
  rownames(statistic_values) <- c("network", "bounded_network")
  GERGM_Object@stats <- statistic_values
  
  #2. Estimate GERGM
  if (estimate_model) {
    GERGM_Object <- Estimate_GERGM(
      formula,
      MPLE.only = use_MPLE_only,
      max.num.iterations = maximum_number_of_lambda_updates,
      mc.num.iterations = maximum_number_of_theta_updates,
      seed = seed,
      tolerance = convergence_tolerance,
      possible.stats = possible_structural_terms,
      GERGM_Object = GERGM_Object,
      force_x_theta_updates = force_x_theta_updates,
      transformation_type =  transformation_type,
      verbose = verbose,
      force_x_lambda_updates = force_x_lambda_updates,
      stop_for_degeneracy = stop_for_degeneracy)

  #3. Perform degeneracy diagnostics and create GOF plots
    if (!GERGM_Object@theta_estimation_converged) {
      warning("Estimation procedure did not detect convergence in Theta estimates. Estimation halted when maximum number of updates was reached. Be careful to assure good model fit or select a more relaxed convergence criterion.")
      GERGM_Object <- store_console_output(GERGM_Object,"Estimation procedure did not detect convergence in Theta estimates. Estimation halted when maximum number of updates was reached. Be careful to assure good model fit or select a more relaxed convergence criterion.")
    }
    if (!GERGM_Object@lambda_estimation_converged) {
      warning("Estimation procedure did not detect convergence in Lambda estimates. Estimation halted when maximum number of updates was reached. Be careful to assure good model fit or select a more relaxed convergence criterion.")
      GERGM_Object <- store_console_output(GERGM_Object,"Estimation procedure did not detect convergence in Lambda estimates. Estimation halted when maximum number of updates was reached. Be careful to assure good model fit or select a more relaxed convergence criterion.")
    }

    #now simulate from last update of theta parameters
    GERGM_Object <- Simulate_GERGM(
      GERGM_Object,
      seed1 = seed,
      possible.stats = possible_structural_terms,
      parallel = GERGM_Object@parallel_statistic_calculation)

    colnames(GERGM_Object@lambda.coef) = gpar.names

    # change back column names if we are dealing with an undirected network
    if (!network_is_directed) {
      change <- which(colnames(GERGM_Object@theta.coef) == "in2stars")
      if (length(change) > 0) {
        colnames(GERGM_Object@theta.coef)[change] <- "twostars"
      }
    }

    init.statistics <- NULL
    if (GERGM_Object@is_correlation_network) {
      init.statistics <- calculate_h_statistics(
        GERGM_Object,
        GERGM_Object@statistic_auxiliary_data,
        all_weights_are_one = FALSE,
        calculate_all_statistics = TRUE,
        use_constrained_network = FALSE)
    }else{
      init.statistics <- calculate_h_statistics(
        GERGM_Object,
        GERGM_Object@statistic_auxiliary_data,
        all_weights_are_one = FALSE,
        calculate_all_statistics = TRUE,
        use_constrained_network = TRUE)
    }
    # fix issue with the wrong stats being saved
    GERGM_Object@stats[2,] <- init.statistics
    hsn.tot <- GERGM_Object@MCMC_output$Statistics

    # save these statistics so we can make GOF plots in the future, otherwise
    # they would be the transformed statistics which would produce poor GOF plots.
    GERGM_Object@simulated_statistics_for_GOF <- hsn.tot
    #thin statsitics
    hsn.tot <- Thin_Statistic_Samples(hsn.tot)

    #calculate t.test p-values for calculating the difference in the means of
    # the newly simulated data with the original network
    statistic_test_p_values <- rep(NA,ncol(hsn.tot))
    for (i in 1:ncol(hsn.tot)) {
      if (length(unique(hsn.tot[,i])) > 1) {
        statistic_test_p_values[i] <- round(t.test(hsn.tot[, i],
                                                   mu = init.statistics[i])$p.value,3)
      }
    }


    stats.data <- data.frame(Observed = init.statistics,
                             Simulated = colMeans(hsn.tot))
    rownames(stats.data) <- GERGM_Object@full_theta_names
    cat("Statistics of observed network and networks simulated from final theta parameter estimates:\n")
    GERGM_Object <- store_console_output(GERGM_Object,"Statistics of observed network and networks simulated from final theta parameter estimates:\n")

    GERGM_Object <- store_console_output(GERGM_Object, toString(stats.data))

    statistic_test_p_values <- data.frame(p_values = statistic_test_p_values)
    rownames(statistic_test_p_values) <- GERGM_Object@full_theta_names
    cat("\nt-test p-values for statistics of observed network and networks simulated from final theta parameter estimates:\n \n")
    GERGM_Object <- store_console_output(GERGM_Object,"\nt-test p-values for statistics of observed network and networks simulated from final theta parameter estimates:\n \n")
    print(statistic_test_p_values)
    GERGM_Object <- store_console_output(GERGM_Object, toString(statistic_test_p_values))


    colnames(statistic_test_p_values) <- "p_values"
    GERGM_Object@observed_simulated_t_test <- statistic_test_p_values

    #test to see if we have an acceptable fit
    check_stats <- GERGM_Object@statistic_auxiliary_data$specified_statistic_indexes_in_full_statistics
    acceptable_fit <- statistic_test_p_values[check_stats, 1]

    if (min(acceptable_fit) > acceptable_fit_p_value_threshold) { # if good fit, print out "indistinguishable"
      GERGM_Object@acceptable_fit <- TRUE
      message("Parameter estimates simulate networks that are statistically indistinguishable from observed network on the statistics specified by the user. ")
      GERGM_Object <- store_console_output(GERGM_Object,"Parameter estimates simulate networks that are statistically indistinguishable from observed network on the statistics specified by the user. ")
    }else{
      GERGM_Object@acceptable_fit <- FALSE
      message("Parameter estimates simulate networks that are statistically distinguishable from observed network. Check GOF plots to determine if the model provides a reasonable fit . This is a very stringent test for goodness of fit, so results may still be acceptable even if this criterion is not met.")
      GERGM_Object <- store_console_output(GERGM_Object, "Parameter estimates simulate networks that are statistically distinguishable from observed network. Check GOF plots to determine if the model provides a reasonable fit . This is a very stringent test for goodness of fit, so results may still be acceptable even if this criterion is not met.")
    }

    GERGM_Object@simulated_bounded_networks_for_GOF <- GERGM_Object@MCMC_output$Networks

    # precalculate intensities and degree distributions so we can return them.
    GERGM_Object <- calculate_additional_GOF_statistics(GERGM_Object)

    #4. output everything to the appropriate files and return GERGM object.
    if (generate_plots) {
      # only generate output if output_name is not NULL
      if (!is.null(output_name)) {
        if (is.null(output_directory)) {
          output_directory <- getwd()
        }
        current_directory <- getwd()
        setwd(output_directory)

        try({
          pdf(file = paste(output_name,"_GOF.pdf",sep = ""),
              height = 4,
              width = 10)
          GOF(GERGM_Object)
          dev.off()

          pdf(file = paste(output_name,"_Parameter_Estimates.pdf",sep = ""),
              height = 4,
              width = 5)
          Estimate_Plot(GERGM_Object)
          dev.off()

          pdf(file = paste(output_name,"_Trace_Plot.pdf",sep = ""),
              height = 4,
              width = 6)
          Trace_Plot(GERGM_Object)
          dev.off()

          save(GERGM_Object,
               file = paste(output_name,"_GERGM_Object.RData",sep = ""))

          write.table(GERGM_Object@console_output,
                      file = paste(output_name,"_Estimation_Log.txt",sep = ""),
                      row.names = F,
                      col.names = F,
                      fileEncoding = "utf8",
                      quote = F)
        })
        setwd(current_directory)
      } else{
        # if we are not saving everything to a directory then just print stuff to
        # the graphics device
        try({
          GOF(GERGM_Object)
          Sys.sleep(2)
          Estimate_Plot(GERGM_Object)
          Sys.sleep(2)
          Trace_Plot(GERGM_Object)
        })
      }
    }

    # transform networks back to observed scale
    # cat("Transforming networks simulated via MCMC as part of the fit diagnostics back on to the scale of observed network. You can access these networks through the '@MCMC_output$Networks' field returned by this function...\n")
    # GERGM_Object <- convert_simulated_networks_to_observed_scale(GERGM_Object)
  } else {
    cat("Returning GERGM object without estimating model...\n")
  }

  # now report the total elapsed time and end time to the user
  end_time <- Sys.time()
  GERGM_Object@end_time <- toString(end_time)

  elapsed_time <- end_time - start_time
  GERGM_Object@elapsed_time <- toString(elapsed_time)

  cat("Estimation Complete at:",toString(end_time),
      "\nElapsed time (seconds):",elapsed_time,"\n")

  # push to slack if desired
  if (GERGM_Object@using_slackr_integration) {
    completion_time <- paste("Estimation Complete at:",toString(end_time))
    elapsed_time <- paste("Elapsed time (seconds):",elapsed_time)
    slackr::slackr_bot(
      completion_time,
      elapsed_time,
      channel = GERGM_Object@slackr_integration_list$channel,
      username = GERGM_Object@slackr_integration_list$model_name,
      incoming_webhook_url = GERGM_Object@slackr_integration_list$incoming_webhook_url)
  }

  return(GERGM_Object)
}
cduron1/STBTWN_GERGM documentation built on Aug. 19, 2019, 4:16 p.m.