pw.assoc | R Documentation |
This function computes some association and Proportional Reduction in Error (PRE) measures between a categorical nominal variable and each of the other available predictors (being also categorical variables).
pw.assoc(formula, data, weights=NULL, out.df=FALSE)
formula |
A formula of the type |
data |
The data frame which contains the variables called by |
weights |
The name of the variable in |
out.df |
Logical. If |
This function computes some association, PRE measures, AIC and BIC for each couple response-predictor that can be created starting from argument formula
. In particular, a two-way contingency table X \times Y
is built for each available X variable (X in rows and Y in columns); then the following measures are considered.
Cramer's V:
V=\sqrt{\frac{\chi^2}{n \times min\left[I-1,J-1\right]} }
n is the sample size, I is the number of rows (categories of X) and J is the number of columns (categories of Y). Cramer's V ranges from 0 to 1.
Bias-corrected Cramer's V (V_c
) proposed by Bergsma (2013).
Mutual information:
I(X;Y) = \sum_{i,j} p_{ij} \, log \left( \frac{p_{ij}}{p_{i+} p_{+j}} \right)
equal to 0 in case of independence but with infinite upper bound, i.e. 0 \leq I(X;Y) < \infty
. In it p_{ij}=n_{ij}/n
.
A normalized version of I(X;Y)
, ranging from 0 (independence) to 1 and not affected by number of categories (I and J):
I(X;Y)^* = \frac{I(X;Y)}{min(H_X, H_Y) }
being H_X
and H_Y
the entropy of the variable X and Y, respectively.
Goodman-Kruskal \lambda(Y|X)
(i.e. response conditional on the given predictor):
\lambda(Y|X) = \frac{\sum_{i=1}^I max_{j}(p_{ij}) - max_{j}(p_{+j})}{1-max_{j}(p_{+j})}
It ranges from 0 to 1, and denotes how much the knowledge of the row variable X (predictor) helps in reducing the prediction error of the values of the column variable Y (response).
Goodman-Kruskal \tau(Y|X)
:
\tau(Y|X) = \frac{ \sum_{i=1}^I \sum_{j=1}^J p^2_{ij}/p_{i+} - \sum_{j=1}^J p_{+j}^2}{1 - \sum_{j=1}^J p_{+j}^2}
It takes values in the interval [0,1] and has the same PRE meaning of the lambda.
Theil's uncertainty coefficient:
U(Y|X) = \frac{\sum_{i=1}^I \sum_{j=1}^J p_{ij} log(p_{ij}/p_{i+}) - \sum_{j=1}^J p_{+j} log p_{+j}}{- \sum_{j=1}^J p_{+j} log p_{+j}}
It takes values in the interval [0,1] and measures the reduction of uncertainty in the column variable Y due to knowing the row variable X. Note that the numerator of U(Y|X) is the mutual information I(X;Y)
It is worth noting that \lambda
, \tau
and U can be viewed as measures of the proportional reduction of the variance of the Y variable when passing from its marginal distribution to its conditional distribution given the predictor X, derived from the general expression (cf. Agresti, 2002, p. 56):
\frac{V(Y) - E[V(Y|X)]}{V(Y)}
They differ in the way of measuring variance, in fact it does not exist a general accepted definition of the variance for a categorical variable.
Finally, AIC (and BIC) is calculated, as suggested in Sakamoto and Akaike (1977). In particular:
AIC(Y|X) = -2 \sum_{i,j} n_{ij} \, log \left( \frac{n_{ij}}{n_{i+}} \right) + 2I(J - 1)
BIC(Y|X) = -2 \sum_{i,j} n_{ij} \, log \left( \frac{n_{ij}}{n_{i+}} \right) +I(J-1) log(n)
being I(J-1)
the parameters (conditional probabilities) to estimate. Note that the R package catdap provides functions to identify the best subset of predictors based on AIC.
Please note that the missing values are excluded from the tables and therefore excluded from the estimation of the various measures.
When out.df=FALSE
(default) a list
object with four components:
V |
A vector with the estimated Cramer's V for each couple response-predictor. |
bcV |
A vector with the estimated bias-corrected Cramer's V for each couple response-predictor. |
mi |
A vector with the estimated mutual information I(X;Y) for each couple response-predictor. |
norm.mi |
A vector with the normalized mutual information I(X;Y)* for each couple response-predictor. |
lambda |
A vector with the values of Goodman-Kruscal |
tau |
A vector with the values of Goodman-Kruscal |
U |
A vector with the values of Theil's uncertainty coefficient U(Y|X) for each couple response-predictor. |
AIC |
A vector with the values of AIC(Y|X) for each couple response-predictor. |
BIC |
A vector with the values of BIC(Y|X) for each couple response-predictor. |
npar |
A vector with the number of parameters (conditional probabilities) estimated to calculate AIC and BIC for each couple response-predictor. |
When out.df=TRUE
the output will be a data.frame with a column for each measure.
Marcello D'Orazio mdo.statmatch@gmail.com
Agresti A (2002) Categorical Data Analysis. Second Edition. Wiley, new York.
Bergsma W (2013) A bias-correction for Cramer's V and Tschuprow's T. Journal of the Korean Statistical Society, 42, 323–328.
The Institute of Statistical Mathematics (2018). catdap: Categorical Data Analysis Program Package. R package version 1.3.4. https://CRAN.R-project.org/package=catdap
Sakamoto Y and Akaike, H (1977) Analysis of Cross-Classified Data by AIC. Ann. Inst. Statist. Math., 30, 185-197.
data(quine, package="MASS") #loads quine from MASS
str(quine)
# how Lrn is response variable
pw.assoc(Lrn~Age+Sex+Eth, data=quine)
# usage of units' weights
quine$ww <- runif(nrow(quine), 1,4) #random gen 1<=weights<=4
pw.assoc(Lrn~Age+Sex+Eth, data=quine, weights="ww")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.