pw.assoc | R Documentation |

This function computes some association and *Proportional Reduction in Error* (PRE) measures between a categorical nominal variable and each of the other available predictors (being also categorical variables).

pw.assoc(formula, data, weights=NULL, out.df=FALSE)

`formula` |
A formula of the type |

`data` |
The data frame which contains the variables called by |

`weights` |
The name of the variable in |

`out.df` |
Logical. If |

This function computes some association, PRE measures, AIC and BIC for each couple response-predictor that can be created starting from argument `formula`

. In particular, a two-way contingency table *X x Y* is built for each available X variable (X in rows and Y in columns); then the following measures are considered.

Cramer's *V*:

* (Chi^2/(N*min(I-1,J-1)))^0.5*

*n* is the sample size, *I* is the number of rows (categories of X) and *J* is the number of columns (categories of Y). Cramer's *V* ranges from 0 to 1.

Bias-corrected Cramer's *V* (*V_c*) proposed by Bergsma (2013).

Mutual information:

* sum_ij (p_ij*log p_ij/(p_i+*p_+j) ) *

equal to 0 in case of independence but with infinite upper bound, i.e. *0 <= I(X;Y) < Infinite*. In it *p_ij=n_ij/n*.

A normalized version of *I(X;Y)*, ranging from 0 (independence) to 1 and not affected by number of categories (*I* and *J*):

*I* = I/(min(H_X, H_Y) )*

being *H_X* and *H_Y* the entropy of the variable X and Y, respectively.

Goodman-Kruskal *lambda(Y|X)* (i.e. response conditional on the given predictor):

* lambda(Y|X) = (sum_i max_j(p_ij) - max_j(p_+j))/(1 - max_j(p_+j)) *

It ranges from 0 to 1, and denotes how much the knowledge of the row variable X (predictor) helps in reducing the prediction error of the values of the column variable Y (response).

Goodman-Kruskal *tau(Y|X)*:

* tau(Y|X) = (sum_ij p^2_ij / p_i+ - sum_j p^2_+j)/(1 - sum_j p^2_+j) *

It takes values in the interval [0,1] and has the same PRE meaning of the lambda.

Theil's uncertainty coefficient:

* U(Y|X) = (sum_ij p_ij log (p_ij/pi+) - sum_j p_+j log p_+j) / (- sum_j p_+j log p_+j) *

It takes values in the interval [0,1] and measures the reduction of uncertainty in the column variable Y due to knowing the row variable X. Note that the numerator of U(Y|X) is the mutual information I(X;Y)

It is worth noting that *lambda*, *tau* and *U* can be viewed as measures of the proportional reduction of the variance of the Y variable when passing from its marginal distribution to its conditional distribution given the predictor X, derived from the general expression (cf. Agresti, 2002, p. 56):

* (V(Y) - E[V(Y|X)])/V(Y) *

They differ in the way of measuring variance, in fact it does not exist a general accepted definition of the variance for a categorical variable.

Finally, AIC (and BIC) is calculated, as suggested in Sakamoto and Akaike (1977). In particular:

*-2*sum_ij (n_ij * log(n_ij/n_i+) +2*I*(J-1) ) *

*-2*sum_ij (n_ij * log(n_ij/n_i+) +I*(J-1)*log(n) ) *

being *I*(J-1)* the parameters (conditional probabilities) to estimate. Note that the **R** package catdap provides functions to identify the best subset of predictors based on AIC.

Please note that the missing values are excluded from the tables and therefore excluded from the estimation of the various measures.

When `out.df=FALSE`

(default) a `list`

object with four components:

`V` |
A vector with the estimated Cramer's V for each couple response-predictor. |

`bcV` |
A vector with the estimated bias-corrected Cramer's V for each couple response-predictor. |

`mi` |
A vector with the estimated mutual information I(X;Y) for each couple response-predictor. |

`norm.mi` |
A vector with the normalized mutual information I(X;Y)* for each couple response-predictor. |

`lambda` |
A vector with the values of Goodman-Kruscal |

`tau` |
A vector with the values of Goodman-Kruscal |

`U` |
A vector with the values of Theil's uncertainty coefficient U(Y|X) for each couple response-predictor. |

`AIC` |
A vector with the values of AIC(Y|X) for each couple response-predictor. |

`BIC` |
A vector with the values of BIC(Y|X) for each couple response-predictor. |

`npar` |
A vector with the number of parameters (conditional probabilities) estimated to calculate AIC and BIC for each couple response-predictor. |

When `out.df=TRUE`

the output will be a data.frame with a column for each measure.

Marcello D'Orazio mdo.statmatch@gmail.com

Agresti A (2002) *Categorical Data Analysis. Second Edition*. Wiley, new York.

Bergsma W (2013) A bias-correction for Cramer's V and Tschuprow's T. *Journal of the Korean Statistical Society*, 42, 323–328.

The Institute of Statistical Mathematics (2018). catdap: Categorical Data Analysis Program Package. R package version 1.3.4. https://CRAN.R-project.org/package=catdap

Sakamoto Y and Akaike, H (1977) Analysis of Cross-Classified Data by AIC. *Ann. Inst. Statist. Math.*, 30, 185-197.

data(quine, package="MASS") #loads quine from MASS str(quine) # how Lrn is response variable pw.assoc(Lrn~Age+Sex+Eth, data=quine) # usage of units' weights quine$ww <- runif(nrow(quine), 1,4) #random gen 1<=weights<=4 pw.assoc(Lrn~Age+Sex+Eth, data=quine, weights="ww")

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.