pw.assoc: Pairwise measures between categorical variables

pw.assocR Documentation

Pairwise measures between categorical variables


This function computes some association and Proportional Reduction in Error (PRE) measures between a categorical nominal variable and each of the other available predictors (being also categorical variables).


pw.assoc(formula, data, weights=NULL, out.df=FALSE)



A formula of the type y~x1+x2 where y denotes the name of the categorical variable (a factor in R) which plays the role of the dependent variable, while x1 and x2 are the name of the predictors (both categorical variables). Numeric variables are not allowed; eventual numerical variables should be categorized (see function cut) before being passed to pw.assoc.


The data frame which contains the variables called by formula.


The name of the variable in data which provides the units' weights. Weights are used to estimate frequencies (a cell frequency is estimated by summing the weights of the units which present the given characteristic). Default is NULL (no weights available) and each unit counts 1. When case weight are provided, then they are scales so that their sum equals n, the sample size (assumed to be nrow(data)).


Logical. If NULL measures will be organized in a data frame (a column for each measure).


This function computes some association, PRE measures, AIC and BIC for each couple response-predictor that can be created starting from argument formula. In particular, a two-way contingency table X x Y is built for each available X variable (X in rows and Y in columns); then the following measures are considered.

Cramer's V:


n is the sample size, I is the number of rows (categories of X) and J is the number of columns (categories of Y). Cramer's V ranges from 0 to 1.

Bias-corrected Cramer's V (V_c) proposed by Bergsma (2013).

Mutual information:

sum_ij (p_ij*log p_ij/(p_i+*p_+j) )

equal to 0 in case of independence but with infinite upper bound, i.e. 0 <= I(X;Y) < Infinite. In it p_ij=n_ij/n.

A normalized version of I(X;Y), ranging from 0 (independence) to 1 and not affected by number of categories (I and J):

I* = I/(min(H_X, H_Y) )

being H_X and H_Y the entropy of the variable X and Y, respectively.

Goodman-Kruskal lambda(Y|X) (i.e. response conditional on the given predictor):

lambda(Y|X) = (sum_i max_j(p_ij) - max_j(p_+j))/(1 - max_j(p_+j))

It ranges from 0 to 1, and denotes how much the knowledge of the row variable X (predictor) helps in reducing the prediction error of the values of the column variable Y (response).

Goodman-Kruskal tau(Y|X):

tau(Y|X) = (sum_ij p^2_ij / p_i+ - sum_j p^2_+j)/(1 - sum_j p^2_+j)

It takes values in the interval [0,1] and has the same PRE meaning of the lambda.

Theil's uncertainty coefficient:

U(Y|X) = (sum_ij p_ij log (p_ij/pi+) - sum_j p_+j log p_+j) / (- sum_j p_+j log p_+j)

It takes values in the interval [0,1] and measures the reduction of uncertainty in the column variable Y due to knowing the row variable X. Note that the numerator of U(Y|X) is the mutual information I(X;Y)

It is worth noting that lambda, tau and U can be viewed as measures of the proportional reduction of the variance of the Y variable when passing from its marginal distribution to its conditional distribution given the predictor X, derived from the general expression (cf. Agresti, 2002, p. 56):

(V(Y) - E[V(Y|X)])/V(Y)

They differ in the way of measuring variance, in fact it does not exist a general accepted definition of the variance for a categorical variable.

Finally, AIC (and BIC) is calculated, as suggested in Sakamoto and Akaike (1977). In particular:

-2*sum_ij (n_ij * log(n_ij/n_i+) +2*I*(J-1) )

-2*sum_ij (n_ij * log(n_ij/n_i+) +I*(J-1)*log(n) )

being I*(J-1) the parameters (conditional probabilities) to estimate. Note that the R package catdap provides functions to identify the best subset of predictors based on AIC.

Please note that the missing values are excluded from the tables and therefore excluded from the estimation of the various measures.


When out.df=FALSE (default) a list object with four components:


A vector with the estimated Cramer's V for each couple response-predictor.


A vector with the estimated bias-corrected Cramer's V for each couple response-predictor.


A vector with the estimated mutual information I(X;Y) for each couple response-predictor.


A vector with the normalized mutual information I(X;Y)* for each couple response-predictor.


A vector with the values of Goodman-Kruscal lambda(Y|X) for each couple response-predictor.


A vector with the values of Goodman-Kruscal tau(Y|X) for each couple response-predictor.


A vector with the values of Theil's uncertainty coefficient U(Y|X) for each couple response-predictor.


A vector with the values of AIC(Y|X) for each couple response-predictor.


A vector with the values of BIC(Y|X) for each couple response-predictor.


A vector with the number of parameters (conditional probabilities) estimated to calculate AIC and BIC for each couple response-predictor.

When out.df=TRUE the output will be a data.frame with a column for each measure.


Marcello D'Orazio


Agresti A (2002) Categorical Data Analysis. Second Edition. Wiley, new York.

Bergsma W (2013) A bias-correction for Cramer's V and Tschuprow's T. Journal of the Korean Statistical Society, 42, 323–328.

The Institute of Statistical Mathematics (2018). catdap: Categorical Data Analysis Program Package. R package version 1.3.4.

Sakamoto Y and Akaike, H (1977) Analysis of Cross-Classified Data by AIC. Ann. Inst. Statist. Math., 30, 185-197.


data(quine, package="MASS") #loads quine from MASS

# how Lrn is response variable
pw.assoc(Lrn~Age+Sex+Eth, data=quine)

# usage of units' weights
quine$ww <- runif(nrow(quine), 1,4) #random gen  1<=weights<=4
pw.assoc(Lrn~Age+Sex+Eth, data=quine, weights="ww")

StatMatch documentation built on March 18, 2022, 6:55 p.m.