corr_fun | R Documentation |
Performs correlation type analysis on two mixed-class columns of a given dataframe. The dataframe can contain columns of four types: integer, numeric, factor, and character. The character column is considered as a categorical variable.
corr_fun(
df,
nx,
ny,
p.value = 0.05,
verbose = TRUE,
num.s = 250,
rk = FALSE,
comp = c("greater", "less"),
alternative = c("greater", "less", "two.sided"),
cor.nn = c("pearson", "mic", "dcor", "pps"),
cor.nc = c("lm", "pps"),
cor.cc = c("cramersV", "uncoef", "pps"),
lm.args = list(),
pearson.args = list(),
dcor.args = list(),
mic.args = list(),
pps.args = list(ptest = FALSE),
cramersV.args = list(),
uncoef.args = list()
)
df |
[ |
nx |
[ |
ny |
[ |
p.value |
[ |
verbose |
[ |
num.s |
[ |
rk |
[ |
comp |
[ |
alternative |
[ |
cor.nn |
[ |
cor.nc |
[ |
cor.cc |
[ |
lm.args |
[ |
pearson.args |
[ |
dcor.args |
[ |
mic.args |
[ |
pps.args |
[ |
cramersV.args |
[ |
uncoef.args |
[ |
[list
]
A list containing statistical and basic information with 8 elements:
infer: The method or metric used to assess the relationship between the variables (e.g., Maximal Information Coefficient or Predictive Power Score).
infer.value: The value or score obtained from the specified inference method, representing the strength or quality of the relationship between the variables.
stat: The statistical test or measure associated with the inference method (e.g., P-value or F1_weighted).
stat.value: The numerical value corresponding to the statistical test or measure, providing additional context about the inference (e.g., significance or performance score).
isig: A logical value indicating whether the statistical result is significant (TRUE
) or not, based on predefined criteria (e.g., threshold for P-value).
msg: A message or error related to the inference process.
varx: The name of the first variable in the analysis (independent variable or feature).
vary: The name of the second variable in the analysis (dependent/target variable).
All statistical tests are controlled by the confidence interval of p.value parameter. If the statistical tests do not obtain a significance greater/less than p.value the value of variable isig
will be FALSE
.
If any errors occur during operations the association measure (infer.value
) will be NA
.
The result data
and index
will have N^2
rows, where N is the number of variables of the input data.
By default there is no statistical significance test for the PPS algorithm. In this case isig
is NA, you can enable it by setting ptest = TRUE
in pps.args
.
All the *.args
can modify the parameters (p.value
, comp
, alternative
, num.s
, rk
, ptest
) for the respective method on it's prefix.
Numeric pairs (integer/numeric):
Pearson Correlation Coefficient: A widely used measure of the strength and direction of linear relationships. Implemented using cor
. For more details, see https://doi.org/10.1098/rspl.1895.0041. The value lies between -1 and 1.
Distance Correlation: Based on the idea of expanding covariance to distances, it measures both linear and nonlinear associations between variables. Implemented using dcorT.test
. For more details, see https://doi.org/10.1214/009053607000000505. The value lies between 0 and 1.
Maximal Information Coefficient (MIC): An information-based nonparametric method that can detect both linear and non-linear relationships between variables. Implemented using mine
. For more details, see https://doi.org/10.1126/science.1205438. The value lies between 0 and 1.
Predictive Power Score (PPS): A metric used to assess predictive relations between variables. Implemented using score
. For more details, see https://zenodo.org/record/4091345. The value lies between 0 and 1.
Numeric and categorical pairs (integer/numeric - factor/categorical):
Square Root of R² Coefficient: From linear regression of the numeric variable over the categorical variable. Implemented using lm
. For more details, see https://doi.org/10.4324/9780203774441. The value lies between 0 and 1.
Predictive Power Score (PPS): A metric used to assess predictive relations between numeric and categorical variables. Implemented using score
. For more details, see https://zenodo.org/record/4091345. The value lies between 0 and 1.
Categorical pairs (factor/categorical):
Cramér's V: A measure of association between nominal variables. Computed based on a chi-squared test and implemented using cramersV
. For more details, see https://doi.org/10.1515/9781400883868. The value lies between 0 and 1.
Uncertainty Coefficient: A measure of nominal association between two variables. Implemented using UncertCoef
. For more details, see https://doi.org/10.1016/j.jbi.2010.02.001. The value lies between 0 and 1.
Predictive Power Score (PPS): A metric used to assess predictive relations between categorical variables. Implemented using score
. For more details, see https://zenodo.org/record/4091345. The value lies between 0 and 1.
Igor D.S. Siciliani, Paulo H. dos Santos
KS Srikanth, sidekicks, cor2, 2020. URL https://github.com/talegari/sidekicks/.
Paul van der Laken, ppsr, 2021. URL https://github.com/paulvanderlaken/ppsr.
# since both `nx` and `ny` columns are numerical the method type is defined by `cor.nn`
corr_fun(iris, nx = "Sepal.Length", ny = "Sepal.Width", cor.nn = "dcor")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.