#' MBA
#'
#' MBA performs matrix-based analysis (MBA) as theoretically elaborated in the
#' \href{https://www.biorxiv.org/content/10.1101/459545v1}{manuscript}. The input data
#' should be formulated in a pure-text table that codes the regions and
#' variables. The response variable is usually correlation values (with or
#' without Fisher-transformation) or white-matter properties (e.g., fractional
#' anisotropy, mean diffusivity, radial diffusivity, axial diffusivity, etc.),
#' but it can also be any values from a symmetric matrix (e.g., coherence, mutual
#' information, entropy). In other words, the effects are assumed to be
#' non-directional or non-causal. Diagonals can be included in the input if
#' sensible.
#'
#' \enumerate{ \item You are performing Bayesian analysis!!! So, you will
#' directly obtain the probability of an effect being positive or negative with
#' your data, instead of witch hunt - hunting the straw man of p-value (weirdness
#' of your data when pretending that absolutely nothing exists).
#'
#' \item Avoid using pure numbers to code the labels for categorical variables.
#' The column order does not matter. . You can specify those column names as you
#' prefer, but it saves a little bit scripting if you adopt the default naming
#' for subjects ('Subj'), regions ('ROI1' and 'ROI2') and response variable
#' ('Y'). The column labels ROI1 and ROI2 are meant to indicate the two regions
#' associated with each response value, and they do not mean any sequence or
#' directionality.
#'
#' \item Only provide half of the off-diagonals in the table (no duplicates
#' allowed). Missing data are fine (e.g., white-matter property deemed
#' nonexistent).
#'
#' \item Simple analysis can be done in a few minutes, but computational cost can
#' be very high (e.g., weeks or even months) when the number of regions or
#' subjects is large or when a few explanatory variables are involved. Be
#' patient: there is hope in the near future that further parallelization can be
#' implemented.
#'
#' \item Add more columns if explanatory variables are considered in the model.
#' Currently only between-subjects variables (e.g., sex, patients vs. controls,
#' age) are allowed. Each label in a between-subjects factor (categorical
#' variable) should be coded with at least 1 character (labeling with pure
#' numbers is fine but not recommended). If preferred, you can quantitatively
#' code the levels of a factor yourself by creating k-1 columns for a factor with
#' k levels. However, be careful with your coding strategy because it would
#' impact how to interpret the results. Here is a good reference about factor
#' coding strategies:
#' https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
#'
#' \item It is strongly suggested that a quantitative explanatory variable be
#' standardized with option -stdz; that is, remove the mean and scale by the
#' standard deviation. This will improve the chance of convergence with each
#' Markov chain. If a between-subjects factor (e.g., sex) is involved, it may be
#' better to standardize a quantitative variable within each group in terms of
#' interpretability if the mean value differs substantially. However, do not
#' standardize a between-subjects factor if you quantitatively code it. And do
#' not standardize the response variable if the intercept is of interest!
#'
#' \item With within-subject variables, try to formulate the data as a contrast
#' between two factor levels or as a linear combination of multiple levels.
#'
#' \item The results from MBA are effect estimates for each region pair and at
#' each region. They can be slightly different across different runs or different
#' computers and R package versions due to the nature of randomness involved in
#' Monte Carlo simulations.
#'
#' \item The range in matrix plot may vary across different effects within an
#' analysis. It is possible to force the same range for all plots through
#' fine-tuning within R using the output of .RData. The criteria of color coding
#' for the strength of evidence in matrix plots in the output is as follows:
#' \itemize{ \item Green - two-tailed 95% compatible/uncertainty interval (or
#' probability of effect being positive >= 0.975 or <= 0.025) \item Yellow -
#' one-tailed 95% compatible/uncertainty interval (or probability of effect \item
#' Gray - one-tailed 90% compatible/uncertainty interval (or probability of
#' effect being positive >= 0.90 or <= 0.10) \item white - anything else} }
#'
#' @param dataTable Path to the file on disk that contains the data structure.
#' The data structure should be a table in long format (cf. wide format) in R
#' with a header as the first line, the names Subj, ROI1, ROI2, Y, and Age.
#' Also see the note section of the documentation.
#' @param prefix Prefix is used to specify output file names. The main output is
#' a text with prefix appended with .txt and stores inference information for
#' effects of interest in a tabulated format depending on selected options. The
#' prefix will also be used for other output files such as visualization plots
#' such as matrix plot, and saved R data in binary mode. The .RData can be used
#' for post hoc processing such as customized processing and plotting. Remove
#' the .RData file to save disk space once you deem such a file is no longer
#' useful.
#' @param chains Specify the number of Markov chains. Make sure there are enough
#' processors available on the computer. Most of the time 4 cores are good
#' enough. However, a larger number of chains (e.g., 8, 12) may help achieve
#' higher accuracy for posterior distribution. Choose 1 for a single-processor
#' computer, which is only practical only for simple models.
#' @param iterations Specify the number of iterations per Markov chain. Choose
#' 1000 (default) for simple models (e.g., one or no explanatory variables). If
#' convergence problem occurs as indicated by Rhat being great than 1.1,
#' increase the number of iterations (e.g., 2000) for complex models, which
#' will lengthen the runtime. Unfortunately there is no way to predict the
#' optimum iterations ahead of time.
#' @param model This option specifies the effects associated with explanatory
#' variables. By default (without user input) the model is specified as 1
#' (Intercept). Currently only between-subjects factors (e.g., sex, patients
#' vs. controls) and quantitative variables (e.g., age) are allowed. When no
#' between-subject factors are present, simply put 1 (default) for FORMULA. The
#' expression FORMULA with more than one variable has to be surrounded within
#' (single or double) quotes (e.g., '1+sex', '1+sex+age'. Variable names in the
#' formula should be consistent with the ones used in the header of data table.
#' A+B represents the additive effects of A and B, A:B is the interaction
#' between A and B, and A*B = A+B+A:B. Subject as a variable should not occur
#' in the model specification here.
#' @param MD This option indicates that there are missing data in the input.
#' With n regions, at least n(n-1)/2 values are assumed from each subject in
#' the input with no missing data (default). When missing data are present,
#' invoke this option so that the program will handle it properly.
#' @param r2z This option performs Fisher transformation on the response
#' variable (column Y) if it is correlation coefficient. Do not invoke the
#' option if the transformation has already been applied or the variable is not
#' correlation coefficient.
#' @param cVars Identify categorical (qualitive) variables (or factors) with this
#' option. The list with more than one variable has to be separated with comma
#' (,) without any other characters such as spaces and should be surrounded
#' within (single or double) quotes. For example, -cVars "sex,site"
#' @param qVars Identify quantitative variables (or covariates) with this option.
#' The list with more than one variable has to be separated with comma (,)
#' without any other characters such as spaces and should be surrounded within
#' (single or double) quotes. For example, -qVars "Age,IQ"
#' @param stdz Identify quantitative variables (or covariates) to be
#' standardized. To obtain meaningful and interpretable results and to achieve
#' better convergence of Markov chains with reasonable iterations, it is
#' recommended that all quantitative variables be standardized except for the
#' response variable and indicator variables that code for factors. For
#' example, -stdz "Age,IQ". If the mean of a quantitative variables varies
#' substantially between groups, it may make sense to standardize the variable
#' within each group before plugging the values into the data table. Currently
#' MBA does not offer the option to perform within-group standardization.
#' @param EOI Identify effects of interest in the output by specifying the
#' variable names separated with comma (,). For example, -EOI "sex,age". By
#' default the Intercept is considered to be an effect of interest. Currently
#' only variables, not their interactions, can be directly requested for
#' output. However, most interaction effects can be obtained by either properly
#' coding the variables (see example 3) or post processing.
#' @param qContr Identify comparisons of interest between quantitative variables
#' in the output separated with comma (,). It only allows for pair-wise
#' comparisons between two quantitative variables. For example, -qContr "age
#' vs IQ, age vs weight, IQ vs weight", where V1, V2, and V3 are three
#' quantitative variables and three comparisons, V1 - V2, V1 - V3 and V2 - V3
#' will be provided in the output. Make sure that such comparisons are
#' meaningful (e.g., with the same scale and unit. This can be used to
#' formulate comparisons among factor levels if the user quantitatively codes
#' the factor levels.
#' @param Y This is used to specify the column name that is designated as as the
#' response/outcome variable. The default (when this option is not invoked) is
#' 'Y'.
#' @param Subj This is used to specify the column name that is designated as as
#' the measuring unit variable (usually subject). The default (when this option
#' is not invoked) is 'Subj'.
#' @param ROI1 This var_name is used to specify the column name that is
#' designated as as the region variable for the first set of each region pair.
#' The default (when this option is not invoked) is 'ROI1'.
#' @param ROI2 This var_name is used to specify the column name that is
#' designated as as the region variable for the second set of each region pair.
#' The default (when this option is not invoked) is 'ROI2'.
#' @param do_not_fit_model Do not fit the model
#' @param verb Specify verbosity
#' @param dbgArgs No longer used
#' @note Regarding the input data: \enumerate{ \item There should have at least
#' four columns in the table. These minimum four columns can be in any order but
#' with fixed and reserved with labels: 'Subj', 'ROI1', 'ROI2', and 'Y'. The two
#' columns 'ROI1' and 'ROI2' are meant to code the two regions that are
#' associated with each value under the column Y, and they do not connotate any
#' indication of directionality other than you may want to keep track of a
#' consistent order, for example, in the correlation matrix. More columns can be
#' added in the table for explanatory variables (e.g., groups, age, site) if
#' applicable. Only subject-level (or between-subjects) explanatory variables are
#' allowed at the moment. The columns of 'Subj', 'ROI1' and 'ROI2' code each
#' subject and the two regions associated with each region pair, and these labels
#' that can be any identifiable characters including numbers. The column 'Y' can
#' be correlation value, Fisher-transformed correlation value, or white-matter
#' property between grey-matter regions (e.g., mean diffusivity, fractional
#' anisotropy, radial diffusivity and axial diffusivity). \item Each row is
#' associated with one and only one 'Y' value, which is the response variable in
#' the table of long format (cf. wide format) as defined in R. In the case of
#' correlation matrix or white-matter property matrix, provide only half of the
#' off-diagonals. With n regions, there should have at least n(n-1)/2 rows per
#' subject, assuming no missing data. \item It is fine to have variables (or
#' columns) in the table that are not used in the current analysis. }
#' @references If you want to cite the approach for MBA, consider the following:
#' Chen, G., Bürkner, P.-C., Taylor, P.A., Li, Z., Yin, L., Glen, D.R.,
#' Kinnison, J., Cox, R.W., Pessoa, L., 2019. An Integrative Approach to
#' Matrix-Based Analyses in Neuroimaging. Human Brain Mapping. In press.
#' https://doi.org/10.1101/459545
#'
#' @section Acknowledgement: Thanks to Zhihao Li for motivating me to start the
#' MBA work, and to Paul-Christian Bürkner and the Stan/R communities for
#' their strong support.
#' @author \href{https://afni.nimh.nih.gov/gangchen_homepage}{Gang Chen} \email{gangchen@@mail.nih.gov}
#' @seealso \code{\link{MBA}}
#' @export
#' @import brms
#' @example vignettes/MBA-demo.Rmd
#' @section CLI info:
#' See \code{\link{create_parser_from_function}} for details on this section.
#' \preformatted{
#' ~dest, ~metavar, ~type, ~nargs,
#' "dataTable", "PATH or DATA_AS_STRING", "character", 1,
#' "prefix", "PREFIX", "character", 1,
#' "chains", "N", "integer", 1,
#' "iterations", "N", "integer", 1,
#' "model", "FORMULA", "character", 1,
#' "MD", " ", "logical", 0,
#' "r2z", " ", "logical", 0,
#' "cVars", "CSV_STRING", "character", 1,
#' "qVars", "CSV_STRING", "character", 1,
#' "stdz", "CSV_STRING", "character", 1,
#' "EOI", "CSV_STRING", "character", 1,
#' "qContr", "CSV_STRING", "character", 1,
#' "Y", "STRING", "character", 1,
#' "Subj", "STRING", "character", 1,
#' "ROI1", "STRING", "character", 1,
#' "ROI2", "STRING", "character", 1,
#' "do_not_fit_model", " ", "logical", 0,
#' "verb", "path", "character", 1,
#' "dbgArgs", "path", "character", 1,
#' }
MBA <- function(dataTable, prefix = "result", chains = 4, iterations = 1000, model = 1, MD = FALSE,
r2z = FALSE, cVars = NULL, qVars = "Intercept", stdz = NULL, EOI = "Intercept", qContr = NULL,
Y = "Y", Subj = "Subj", ROI1 = "ROI1", ROI2 = "ROI2", do_not_fit_model = FALSE, verb = 0, dbgArgs = FALSE) {
dpath <- dataTable
outFN <- pprefix.AFNI.name(prefix)
EOIq <- get_eoiq(qVars, EOI)
EOIc <- get_eioc(cVars, EOI)
dataTable <- setup_dataTable(
dpath, model, MD, r2z, cVars, qVars, stdz,
qContr, Y, Subj, ROI1, ROI2
)
log_setup_info(dataTable, outFN,ROI1,ROI2)
# deviation coding: -1/0/1 - the intercept is associated with the mean across the levels of the factor
# each coding variable corresponds to the level relative to the mean: alphabetically last level is
# is baseline or reference level
options(contrasts = c("contr.sum", "contr.poly"))
options(mc.cores = parallel::detectCores())
# combine the levels between the two region lists: NO! It seems to mess up the modeling wih brm
# levels(dataTable[ROI1][[1]]) <- union(levels(dataTable[ROI1][[1]]), levels(dataTable[ROI2][[1]]))
# levels(dataTable[ROI2][[1]]) <- union(levels(dataTable[ROI1][[1]]), levels(dataTable[ROI2][[1]]))
# Start the clock!
ptm <- proc.time()
##################### MCMC ####################
if (!do_not_fit_model) {
fm <- run_mba(dataTable, model, chains, iterations)
post_process(fm, outFN, iterations, chains, EOIq, EOIc, qContr, ptm, ROI1,ROI2)
} else {
fm <- NA
}
fm
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.