JANE | R Documentation |
Fit a latent space cluster model, with or without noise edges, using an EM algorithm.
JANE(
A,
D = 2,
K = 2,
family = "bernoulli",
noise_weights = FALSE,
guess_noise_weights = NULL,
model,
initialization = "GNN",
case_control = FALSE,
DA_type = "none",
seed = NULL,
control = list()
)
A |
A square matrix or sparse matrix of class 'dgCMatrix' representing the adjacency matrix of the network of interest. |
D |
Integer (scalar or vector) specifying the dimension of the latent space (default is 2). |
K |
Integer (scalar or vector) specifying the number of clusters to consider (default is 2). |
family |
A character string specifying the distribution of the edge weights.
|
noise_weights |
A logical; if |
guess_noise_weights |
Only applicable if |
model |
A character string specifying the model to fit:
|
initialization |
A character string or a list to specify the initial values for the EM algorithm:
|
case_control |
A logical; if |
DA_type |
(Experimental) A character string to specify the type of deterministic annealing approach to use
|
seed |
(optional) An integer value to specify the seed for reproducibility. |
control |
A list of control parameters. See 'Details'. |
Isolates are removed from the adjacency matrix A. If an unsymmetric adjacency matrix A is supplied for model %in% c('NDH', 'RS')
the user will be asked if they would like to proceed with converting A to a symmetric matrix (i.e., A <- 1.0 * ( (A + t(A)) > 0.0 )
); only able to do so if family = 'bernoulli'
. Additionally, if a weighted network is supplied and noise_weights = FALSE
, then the network will be converted to an unweighted binary network (i.e., (A > 0.0)*1.0) and a latent space cluster model is fit.
control
:
The control
argument is a named list that the user can supply containing the following components:
verbose
A logical; if TRUE
causes additional information to be printed out about the progress of the EM algorithm (default is FALSE
).
max_its
An integer specifying the maximum number of iterations for the EM algorithm (default is 1e3
).
min_its
An integer specifying the minimum number of iterations for the EM algorithm (default is 10
).
priors
A list of S3 class
"JANE.priors
" representing prior hyperparameter specifications (default is NULL
). See specify_priors
for details on how to specify the hyperparameters.
n_interior_knots
(only relevant for model %in% c('RS', 'RSR')
) An integer specifying the number of interior knots used in fitting a natural cubic spline for degree heterogeneity (and connection strength heterogeneity if working with weighted network) models (default is 5
).
termination_rule
A character string to specify the termination rule to determine the convergence of the EM algorithm:
'prob_mat'
: uses change in the absolute difference in \hat{Z}^{U}
(i.e., the N \times K
cluster membership probability matrix) between subsequent iterations (default)
'Q'
: uses change in the absolute difference in the objective function of the E-step evaluated using parameters from subsequent iterations
'ARI'
: comparing the classifications between subsequent iterations using adjusted Rand index
'NMI'
: comparing the classifications between subsequent iterations using normalized mutual information
'CER'
: comparing the classifications between subsequent iterations using classification error rate
tolerance
A numeric specifying the tolerance used for termination_rule %in% c('Q', 'prob_mat')
(default is 1e-3
).
tolerance_ARI
A numeric specifying the tolerance used for termination_rule = 'ARI'
(default is 0.999
).
tolerance_NMI
A numeric specifying the tolerance used for termination_rule = 'NMI'
(default is 0.999
).
tolerance_CER
A numeric specifying the tolerance used for termination_rule = 'CER'
(default is 0.01
).
n_its_start_CA
An integer specifying what iteration to start computing the change in cumulative averages (note: the change in the cumulative average of \hat{U}
, the latent position matrix, is not tracked when termination_rule = 'Q'
) (default is 20
).
tolerance_diff_CA
A numeric specifying the tolerance used for the change in cumulative average of termination_rule
metric and \hat{U}
(note: the change in the cumulative average of \hat{U}
is not tracked when termination_rule = 'Q'
) (default is 1e-3
).
consecutive_diff_CA
An integer specifying the tolerance for the number of consecutive instances where the change in cumulative average is less than tolerance_diff_CA
(default is 5
).
quantile_diff
A numeric in [0,1]
specifying the quantile used in computing the change in the absolute difference of \hat{Z}^{U}
and \hat{U}
between subsequent iterations (default is 1
, i.e., max).
beta_temp_schedule
(Experimental) A numeric vector specifying the temperature schedule for deterministic annealing (default is 1
, i.e., deterministic annealing not utilized).
n_control
An integer specifying the fixed number of controls (i.e., non-links) sampled for each actor; only relevant when case_control = TRUE
(default is 100
when case_control = TRUE
and NULL
when case_control = FALSE
).
n_start
An integer specifying the maximum number of starts for the EM algorithm (default is 5
).
max_retry
An integer specifying the maximum number of re-attempts if starting values cause issues with EM algorithm (default is 5
).
IC_selection
A character string to specify the information criteria used to select the optimal fit based on the combinations of K
, D
, and n_start
considered:
'BIC_model'
: BIC computed from logistic regression or Hurdle model component
'BIC_mbc'
: BIC computed from model based clustering component
'ICL_mbc'
: ICL computed from model based clustering component
'Total_BIC'
: sum of 'BIC_model'
and 'BIC_mbc'
'Total_ICL'
: sum of 'BIC_model'
and 'ICL_mbc'
(default)
sd_random_U_GNN
(only relevant when initialization = 'GNN'
) A positive numeric value specifying the standard deviation for the random draws from a normal distribution to initialize U
(default is 1
).
max_retry_GNN
(only relevant when initialization = 'GNN'
) An integer specifying the maximum number of re-attempts for the GNN
approach before switching to random starting values (default is 10
).
n_its_GNN
(only relevant when initialization = 'GNN'
) An integer specifying the maximum number of iterations for the GNN
approach (default is 10
).
downsampling_GNN
(only relevant when initialization = 'GNN'
) A logical; if TRUE
employs downsampling s.t. the number of links and non-links are balanced for the GNN
approach (default is TRUE
).
Running JANE
in parallel:
JANE
integrates the future and future.apply packages to fit the various combinations of K
, D
, and n_start
in parallel. The 'Examples' section below provides an example of how to run JANE
in parallel. See plan
and future.apply
for more details.
Choosing the number of clusters:
JANE
allows for the following model selection criteria to choose the number of clusters (smaller values are favored):
'BIC_model'
: BIC computed from logistic regression or Hurdle model component
'BIC_mbc'
: BIC computed from model based clustering component
'ICL_mbc'
: ICL (Biernacki et al. (2000)) computed from model based clustering component
'Total_BIC'
: Sum of 'BIC_model'
and 'BIC_mbc'
, this is the model selection criterion proposed by Handcock et al. (2007)
'Total_ICL'
: (default) sum of 'BIC_model'
and 'ICL_mbc'
, this criterion is similar to 'Total_BIC'
, but uses ICL for the model based clustering component, which tends to favor more well-separated clusters.
Based on simulation studies, Biernacki et al. (2000) recommends that when the interest in mixture modeling is cluster analysis, instead of density estimation, the ICL_{mbc}
criterion should be favored over the BIC_{mbc}
criterion, as the BIC_{mbc}
criterion tends to overestimate the number of clusters. The BIC_{mbc}
criterion is designed to choose the number of components in a mixture model rather than the number of clusters.
Warning: It is not certain whether it is appropriate to use the model selection criterion above to select D
.
A list of S3 class
"JANE
" containing the following components:
input_params |
A list containing the input parameters for |
A |
The square sparse adjacency matrix of class 'dgCMatrix' used in fitting the latent space cluster model. This matrix can be different than the input A matrix as isolates are removed. |
IC_out |
A matrix containing the relevant information criteria for all combinations of |
all_convergence_ind |
A matrix containing the convergence information (i.e., 1 = converged, 0 = did not converge) and number of iterations for all combinations of |
optimal_res |
A list containing the estimated parameters of interest based on the optimal fit selected. It is recommended to use |
optimal_starting |
A list of S3 |
Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 719–725.
Handcock, M.S., Raftery, A.E., Tantrum, J.M., 2007. Model-based clustering for social networks. Journal of the Royal Statistical Society Series A: Statistics in Society 170, 301–354.
# Simulate network
mus <- matrix(c(-1,-1,1,-1,1,1),
nrow = 3,
ncol = 2,
byrow = TRUE)
omegas <- array(c(diag(rep(7,2)),
diag(rep(7,2)),
diag(rep(7,2))),
dim = c(2,2,3))
p <- rep(1/3, 3)
beta0 <- 1.0
sim_data <- JANE::sim_A(N = 100L,
model = "NDH",
mus = mus,
omegas = omegas,
p = p,
params_LR = list(beta0 = beta0),
remove_isolates = TRUE)
# Run JANE on simulated data
res <- JANE::JANE(A = sim_data$A,
D = 2L,
K = 3L,
initialization = "GNN",
model = "NDH",
case_control = FALSE,
DA_type = "none")
# Run JANE on simulated data - consider multiple D and K
res <- JANE::JANE(A = sim_data$A,
D = 2:5,
K = 2:10,
initialization = "GNN",
model = "NDH",
case_control = FALSE,
DA_type = "none")
# Run JANE on simulated data - parallel with 5 cores (NOT RUN)
# future::plan(future::multisession, workers = 5)
# res <- JANE::JANE(A = sim_data$A,
# D = 2L,
# K = 3L,
# initialization = "GNN",
# model = "NDH",
# case_control = FALSE,
# DA_type = "none")
# future::plan(future::sequential)
# Run JANE on simulated data - case/control approach with 20 controls sampled for each actor
res <- JANE::JANE(A = sim_data$A,
D = 2L,
K = 3L,
initialization = "GNN",
model = "NDH",
case_control = TRUE,
DA_type = "none",
control = list(n_control = 20))
# Reproducibility
res1 <- JANE::JANE(A = sim_data$A,
D = 2L,
K = 3L,
initialization = "GNN",
seed = 1234,
model = "NDH",
case_control = FALSE,
DA_type = "none")
res2 <- JANE::JANE(A = sim_data$A,
D = 2L,
K = 3L,
initialization = "GNN",
seed = 1234,
model = "NDH",
case_control = FALSE,
DA_type = "none")
## Check if results match
all.equal(res1, res2)
# Another reproducibility example where the seed was not set.
# It is possible to replicate the results using the starting values due to
# the nature of EM algorithms
res3 <- JANE::JANE(A = sim_data$A,
D = 2L,
K = 3L,
initialization = "GNN",
model = "NDH",
case_control = FALSE,
DA_type = "none")
## Extract starting values
start_vals <- res3$optimal_start
## Run JANE using extracted starting values, no need to specify D and K
## below as function will determine those values from start_vals
res4 <- JANE::JANE(A = sim_data$A,
initialization = start_vals,
model = "NDH",
case_control = FALSE,
DA_type = "none")
## Check if optimal_res are identical
all.equal(res3$optimal_res, res4$optimal_res)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.