#' @title Compute Correlations for Data Frame
#'
#' @description Computes correlation-type analysis on large data frames with mixed column types, including integer, numeric, factor, and character. Character columns are treated as categorical variables.\cr
# This function supports parallel processing, allowing faster computations on large datasets. It ensures that different column types are handled appropriately without requiring manual adjustments.\cr
# The method is designed to work efficiently with mixed data, providing a flexible and fast way to analyze relationships between numerical and categorical variables.
#'
#' @name corrp
#'
#' @section Pair Types:
#'
#' **Numeric pairs (integer/numeric):**
#'
#' - **Pearson Correlation Coefficient:** A widely used measure of the strength and direction of linear relationships. Implemented using \code{\link[stats]{cor}}. For more details, see \url{https://doi.org/10.1098/rspl.1895.0041}. The value lies between -1 and 1.\cr
#' - **Distance Correlation:** Based on the idea of expanding covariance to distances, it measures both linear and nonlinear associations between variables. Implemented using \code{\link[energy]{dcorT.test}}. For more details, see \url{https://doi.org/10.1214/009053607000000505}. The value lies between 0 and 1.\cr
#' - **Maximal Information Coefficient (MIC):** An information-based nonparametric method that can detect both linear and non-linear relationships between variables. Implemented using \code{\link[minerva]{mine}}. For more details, see \url{https://doi.org/10.1126/science.1205438}. The value lies between 0 and 1.\cr
#' - **Predictive Power Score (PPS):** A metric used to assess predictive relations between variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr\cr
#'
#' **Numeric and categorical pairs (integer/numeric - factor/categorical):**
#'
#' - **Square Root of R² Coefficient:** From linear regression of the numeric variable over the categorical variable. Implemented using \code{\link[stats]{lm}}. For more details, see \url{https://doi.org/10.4324/9780203774441}. The value lies between 0 and 1.\cr
#' - **Predictive Power Score (PPS):** A metric used to assess predictive relations between numeric and categorical variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr\cr
#'
#' **Categorical pairs (factor/categorical):**
#'
#' - **Cramér's V:** A measure of association between nominal variables. Computed based on a chi-squared test and implemented using \code{\link[lsr]{cramersV}}. For more details, see \url{https://doi.org/10.1515/9781400883868}. The value lies between 0 and 1.\cr
#' - **Uncertainty Coefficient:** A measure of nominal association between two variables. Implemented using \code{\link[DescTools]{UncertCoef}}. For more details, see \url{https://doi.org/10.1016/j.jbi.2010.02.001}. The value lies between 0 and 1.\cr
#' - **Predictive Power Score (PPS):** A metric used to assess predictive relations between categorical variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr
#'
#' @return \[\code{clist}\]\cr
#' A `clist` which is a list that has two tables: `data` and `index`.
#'
#' - **data**: A table containing all the statistical results. The columns of this table are as follows:
#'
#' - `infer`: The method or metric used to assess the relationship between the variables (e.g., Maximal Information Coefficient or Predictive Power Score).
#' - `infer.value`: The value or score obtained from the specified inference method, representing the strength or quality of the relationship between the variables.
#' - `stat`: The statistical test or measure associated with the inference method (e.g., P-value or F1_weighted).
#' - `stat.value`: The numerical value corresponding to the statistical test or measure, providing additional context about the inference (e.g., significance or performance score).
#' - `isig`: A logical value indicating whether the statistical result is significant (`TRUE`) or not, based on predefined criteria (e.g., threshold for P-value).
#' - `msg`: A message or error related to the inference process.
#' - `varx`: The name of the first variable in the analysis (independent variable or feature).
#' - `vary`: The name of the second variable in the analysis (dependent/target variable).
#'
#'
#'
#' - **index**: A table that contains the pairs of indices used in each inference of the `data` table.
#'
#'
#' All statistical tests are controlled by the confidence interval of p.value parameter. If the statistical tests do not obtain a significance greater/less than p.value the value of variable `isig` will be `FALSE`.\cr
#' If any errors occur during operations the association measure (`infer.value`) will be `NA`.\cr
#' The result `data` and `index` will have \eqn{N^2} rows, where N is the number of variables of the input data.
#' By default there is no statistical significance test for the PPS algorithm. In this case `isig` is NA, you can enable it by setting `ptest = TRUE` in `pps.args`.\cr
#' All the `*.args` can modify the parameters (`p.value`, `comp`, `alternative`, `num.s`, `rk`, `ptest`) for the respective method on it's prefix.
#'
#' @param df \[\code{data.frame}\]\cr input data frame.
#' @param parallel \[\code{logical(1)}\]\cr If it's TRUE run the operations in parallel backend.
#' @param n.cores \[\code{numeric(1)}\]\cr The number of cores to use for parallel execution.
#' @param p.value \[\code{logical(1)}\]\cr
#' P-value probability of obtaining the observed results of a test,
#' assuming that the null hypothesis is correct. By default p.value=0.05 (Cutoff value for p-value.).
#' @param comp \[\code{character(1)}\]\cr The parameter \code{p.value} must be greater
#' or less than those estimated in tests and correlations.
#' @param alternative \[\code{character(1)}\]\cr a character string specifying the alternative hypothesis for
#' the correlation inference. It must be one of "two.sided" (default), "greater" or "less".
#' You can specify just the initial letter.
#' @param verbose \[\code{logical(1)}\]\cr Activate verbose mode.
#' @param num.s \[\code{numeric(1)}\]\cr Used in permutation test. The number of samples with
#' replacement created with y numeric vector.
#' @param rk \[\code{logical(1)}\]\cr Used in permutation test.
#' if its TRUE transform x, y numeric vectors with samples ranks.
#' @param cor.nn \[\code{character(1)}\]\cr
#' Choose correlation type to be used in integer/numeric pair inference.
#' The options are `pearson: Pearson Correlation`,`mic: Maximal Information Coefficient`,
#' `dcor: Distance Correlation`,`pps: Predictive Power Score`.Default is `Pearson Correlation`.
#' @param cor.nc \[\code{character(1)}\]\cr
#' Choose correlation type to be used in integer/numeric - factor/categorical pair inference.
#' The option are `lm: Linear Model`,`pps: Predictive Power Score`. Default is `Linear Model`.
#' @param cor.cc \[\code{character(1)}\]\cr
#' Choose correlation type to be used in factor/categorical pair inference.
#' The option are `cramersV: Cramer's V`,`uncoef: Uncertainty coefficient`,
#' `pps: Predictive Power Score`. Default is `Cramer's V`.
#' @param lm.args \[\code{list}\]\cr additional parameters for linear model to be passed to \code{\link[stats]{lm}}.
#' @param pearson.args \[\code{list}\]\cr additional parameters for Pearson correlation to be passed to \code{\link[stats]{cor.test}}.
#' @param dcor.args \[\code{list}\]\cr additional parameters for the distance correlation to be passed to \code{\link[corrp]{dcor_t_test}}.
#' @param mic.args \[\code{list}\]\cr additional parameters for the maximal information coefficient to be passed to \code{\link[minerva]{mine}}.
#' @param pps.args \[\code{list}\]\cr additional parameters for the predictive power score to be passed to \code{\link[ppsr]{score}}.
#' @param uncoef.args \[\code{list}\]\cr additional parameters for the uncertainty coefficient to be passed to \code{\link[DescTools]{UncertCoef}}.
#' @param cramersV.args \[\code{list}\]\cr additional parameters for the Cramer's V to be passed to \code{\link[lsr]{cramersV}}.
#'
#' @author Igor D.S. Siciliani, Paulo H. dos Santos
#'
#' @keywords correlation, predictive power score, linear model, distance correlation, mic, pearson, cramér's V
#'
#' @references
#' KS Srikanth, sidekicks, cor2, 2020. URL: \url{https://github.com/talegari/sidekicks/}.
#' Paul van der Laken, ppsr, 2021. URL: \url{https://github.com/paulvanderlaken/ppsr/}.
#'
#' @examples
#' # Usage with default settings
#' iris_c <- corrp(iris)
#' iris_m <- corr_matrix(iris_c, isig = FALSE) # You can then make correlation matrix
#' if (require("corrplot")) {
#' corrplot(iris_m) # You can visualize the matrix using corrplot
#' }
#'
#' # Using PPS for both numeric-numeric and numeric-categorical pairs
#' iris_c1 <- corrp(iris, cor.nn = "pps", cor.nc = "pps")
#'
#' # Using Distance Correlation for numeric-numeric and Predictive Power Score for numeric-categorical
#' iris_c2 <- corrp(iris, cor.nn = "dcor", cor.nc = "pps", dcor.args = list(method = "auto"))
#'
#' # Using Maximal Information Coefficient (MIC) for numeric-numeric
#' # and Uncertainty Coefficient for categorical-categorical
#' iris_c3 <- corrp(iris, cor.nn = "mic", cor.cc = "uncoef", mic.args = list(alpha = 0.6))
#'
#' # Using PPS for all pair types
#' iris_c4 <- corrp(iris, cor.nn = "pps", cor.nc = "pps", cor.cc = "pps")
#'
#' # Using Distance Correlation for numeric-numeric, Predictive Power Score for numeric-categorical,
#' # and Uncertainty Coefficient for categorical-categorical
#' iris_c5 <- corrp(
#' iris,
#' cor.nn = "dcor", cor.nc = "pps", cor.cc = "uncoef",
#' dcor.args = list(method = "auto")
#' )
#' @export
corrp <- function(df,
parallel = TRUE,
n.cores = 1,
p.value = 0.05,
verbose = TRUE,
num.s = 250,
rk = FALSE,
comp = c("greater", "less"),
alternative = c("greater", "less", "two.sided"),
cor.nn = c("pearson", "mic", "dcor", "pps"),
cor.nc = c("lm", "pps"),
cor.cc = c("cramersV", "uncoef", "pps"),
lm.args = list(),
pearson.args = list(),
dcor.args = list(),
mic.args = list(),
pps.args = list(ptest = FALSE),
cramersV.args = list(),
uncoef.args = list()) {
assert_required_argument(df, "The 'df' argument must be a data.frame containing the data to analyze.")
alternative <- match.arg(alternative)
cor.nn <- match.arg(cor.nn)
cor.nc <- match.arg(cor.nc)
cor.cc <- match.arg(cor.cc)
comp <- match.arg(comp)
comp <- substr(comp, 1, 1)
checkmate::assertDataFrame(df)
checkmate::assert_character(comp, len = 1, pattern = "l|g")
alternative <- substr(alternative, 1, 1)
checkmate::assert_character(alternative, len = 1, pattern = "t|l|g")
checkmate::assert_logical(verbose, len = 1)
checkmate::assert_logical(parallel, len = 1)
checkmate::assertNumber(p.value, upper = 1, lower = 0)
checkmate::assertNumber(n.cores)
checkmate::assert_list(lm.args)
checkmate::assert_list(pearson.args)
checkmate::assert_list(dcor.args)
checkmate::assert_list(mic.args)
checkmate::assert_list(pps.args)
checkmate::assert_list(cramersV.args)
checkmate::assert_list(uncoef.args)
if (!all(vapply(df, class, character(1)) %in% c("integer", "numeric", "factor", "character"))) {
stop("All columns must be of type integer, numeric, factor, or character.")
}
cnames <- colnames(df)
index.grid <- expand.grid("i" = seq(1, NCOL(df)), "j" = seq(1, NCOL(df)), stringsAsFactors = FALSE)
# Parallel corr_fun
if (parallel) {
cluster <- parallel::makeCluster(n.cores)
on.exit(parallel::stopCluster(cluster))
parallel::clusterExport(
cl = cluster,
c(
"df", "p.value", "verbose", "alternative", "comp",
"cor.nn", "cor.nc", "cor.cc", "num.s", "rk", "lm.args",
"pearson.args", "cramersV.args", "dcor.args",
"pps.args", "mic.args", "uncoef.args"
),
envir = environment()
)
corr <- parallel::parLapplyLB(
cluster, seq_len(NROW(index.grid)),
function(k) {
ny <- cnames[index.grid[["i"]][k]]
nx <- cnames[index.grid[["j"]][k]]
corr_fun(df,
ny = ny,
nx = nx,
p.value = p.value,
verbose = verbose,
alternative = alternative,
comp = comp,
cor.nn = cor.nn,
cor.nc = cor.nc,
cor.cc = cor.cc,
num.s = num.s,
rk = rk,
lm.args = lm.args,
pearson.args = pearson.args,
cramersV.args = cramersV.args,
dcor.args = dcor.args,
pps.args = pps.args,
mic.args = mic.args,
uncoef.args = uncoef.args
)
}
)
} else {
# Sequential corr_fun
corr <- lapply(
seq_len(NROW(index.grid)),
function(k) {
ny <- cnames[index.grid[["i"]][k]]
nx <- cnames[index.grid[["j"]][k]]
corr_fun(df,
ny = ny,
nx = nx,
p.value = p.value,
verbose = verbose,
alternative = alternative,
comp = comp,
cor.nn = cor.nn,
cor.nc = cor.nc,
cor.cc = cor.cc,
num.s = num.s,
rk = rk,
lm.args = lm.args,
pearson.args = pearson.args,
cramersV.args = cramersV.args,
dcor.args = dcor.args,
pps.args = pps.args,
mic.args = mic.args,
uncoef.args = uncoef.args
)
}
)
}
corr <- lapply(corr, .null.to.na)
corr <- do.call(rbind.data.frame, corr)
corrp.list <- list(data = corr, index = index.grid)
return(structure(corrp.list, class = c("clist", "list")))
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.