Descriptive analysis of statistical associations with GDAtools

library(knitr)
library(rmdformats)

oldpar <- par() 
oldoptions <- options()

## Global options
options(max.print="75", width=1000)
opts_chunk$set(echo=TRUE,
                 cache=FALSE,
               prompt=FALSE,
               tidy=FALSE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=75)

\

GDAtools package provides some functions dedicated to the description of statistical associations between variables. They are based on effect size measures (also called association measures).

All these measures are built from simple concepts (correlations, proportion of variance explained), they are bounded (between -1 and 1 or between 0 and 1) and are not sensitive to the number of observations.

The measures of global association are the following.

In addition to measures of global association, we also use measures of local association, i.e. at the level of the categories of the variables.

Note that if we code the categories of the categorical variables as binary variables with values of 0 or 1, the phi coefficient and the point biserial correlation are equivalent to Pearson's correlation coefficient.

For more details on these effect size measurements, see: Rakotomalala R., « Comprendre la taille d'effet (effect size) »
\

In some functions of GDAtools, association measures can be completed by permutation tests, which are part of combinatorial inference and constitute a nonparametric alternative to the significance tests of frequentist inference. A permutation test is performed in several steps.

  1. A measure of association between the two variables under study is computed.

  2. The same measure of association is calculated from a "permuted" version of the data, i.e. by randomly "mixing" the values of one of the variables, in order to "break" the relationship between the variables.

  3. Repeat step 2 a large number of times. This gives an empirical distribution (as opposed to the use of a theoretical distribution by frequentist inference) of the measure of association under the H0 hypothesis of no relationship between the two variables.

  4. The result of step 1 is compared with the distribution obtained in 3. The p-value of the permutation test is the proportion of values of the H0 distribution that are more extreme than the measure of association observed in 1.

If all possible permutations are performed, the permutation test is called "exact". In practice, the computation time required is often too important and only a part of the possible permutations is performed, resulting in an "approximate" test. In the following examples, the number of permutations is set to 100 to reduce the computation time, but it is advisable to increase this number to obtain more accurate and reliable results (for example nperm=1000).
\

To illustrate the statistical association analysis functions of GDAtools, we will use data on cinema. This is a sample of 1000 films released in France in the 2000s, for which we know the budget, the genre, the country of origin, the "art et essai" label, the selection in a festival (Cannes, Berlin or Venice), the average rating of intellectual critics (according to Allociné) and the number of admissions. Some of these variables are numerical, others are categorical.

library(GDAtools)
data(Movies)
str(Movies)

Relationship between two variables

The package offers several functions to study the statistical relationship between two variables, depending on the nature (categorical or numerical) of these variables.

Two categorical variables

The function assoc_twocat computes :

assoc.twocat(Movies$Country, Movies$ArtHouse, nperm=100)

\ The function ggassoc_crosstab presents the contingency table in graphical form, with rectangles whose area corresponds to the numbers and whose color gradient corresponds to the attractions/repulsions (phi coefficients). The Cramér's V can be displayed in a corner of the graph. Here, the "art et essai" label is clearly over-represented among French films and under-represented among American films.

ggassoc_crosstab(Movies, ggplot2::aes(x=Country, y=ArtHouse), max.phi=0.8)

The function ggassoc_phiplot proposes another way to represent the attractions/repulsions. The width of the rectangles corresponds to the numbers of the variable x, their height to the phi coefficients. The rectangles are colored in black when there is attraction, in white when there is repulsion.

ggassoc_phiplot(Movies, ggplot2::aes(x=Country, y=ArtHouse))

One categorical variable and one numerical variable

The assoc_catcont function computes:

assoc.catcont(Movies$Country, Movies$Critics, nperm=100)

\ The function ggassoc_boxplot represents the relationship between the variables in the form of box-plots and/or "violin" distributions. The eta² value is displayed in a corner of the graph.

ggassoc_boxplot(Movies, ggplot2::aes(x=Country, y=Critics))

Two numerical variables

The function assoc_twocont computes the Kendall and Spearman rank correlations and the Pearson linear correlation, as well as the p-values of the corresponding permutation tests.

assoc.twocont(Movies$Budget, Movies$BoxOffice, nperm=100)

\ The function ggassoc_scatter represents the relationship between the two variables in the form of a scatterplot, with an approximation by smoothing (with the "Generalized Additive Model" method). Kendall's tau is displayed in a corner of the graph.

ggassoc_scatter(Movies, ggplot2::aes(x=Budget, y=BoxOffice))

Relationships between one variable Y and a set of variables X

Often, not just two variables are studied, but a larger set of variables. When one of these variables has the status of an "explained" variable, one generally uses regression models or, possibly, supervised learning models (see the vignette of the moreparty package for an example). However, it is essential to know all the bivariate relations of the dataset before moving to an "all else being equal" approach.

It should be noted that if we do this work in a meticulous way, adding eventually the descriptive analysis of the relationships between three or four variables, we often realize that the surplus of knowledge brought by the regression models is quite limited.
\

The function assoc.yx computes the global association between Y and each of the variables of X, and for all pairs of variables of X.

assoc.yx(Movies$BoxOffice, Movies[,-7], nperm=100)

\ The functions catdesc and condesc allow to go into more detail about the relationships, by going to the category level.

catdesc deals with the cases where Y is a categorical variable. For a categorical variable X1, it computes, for a given category of Y and a category of X1 :

The results are sorted by decreasing phi coefficient and can be filtered to keep only the associations above a given threshold (in absolute value).

For a numerical variable X2, for a given category of Y, it computes :

Dispersion is measured by the "median absolute deviation" (MAD), which is the median of the absolute deviations from the median. The median and MAD are so-called "robust" indicators, i.e. not sensitive to outliers.

catdesc(Movies$Festival, Movies[,-5])$bylevel$Yes

\ condesc deals with the cases where Y is a numerical variable. For a categorical variable X1, it computes :

The results are sorted by decreasing point biserial correlation and can be filtered to keep only associations above a given threshold (in absolute value).

For the numerical variables of X, it calculates the Kendall's tau.

condesc(Movies$BoxOffice, Movies[,-7], nperm=100)

\ The darma function presents the results in a form close to that of a table of results of a regression.

When the variable Y is numerical, the function computes :

darma(Movies$BoxOffice, Movies[,-7], nperm=100)
knitr::kable(darma(Movies$BoxOffice, Movies[,-7], nperm=100), row.names=FALSE, format="simple")

\ When the variable Y is categorical, the function computes :

darma(Movies$Festival, Movies[,-5], target=2, nperm=100)
knitr::kable(darma(Movies$Festival, Movies[,-5], target=2, nperm=100), row.names=FALSE, format="simple")

Relationships between all variables in a set

Finally, the ggassoc_* functions are designed to be integrated in the plot matrices of the GGally package. It is thus possible to use them to represent in a single plot all the bivariate relations of a group of variables.

library(GGally)
ggpairs(Movies,
        lower = list(continuous = ggassoc_scatter,
                     combo = ggassoc_boxplot,
                     discrete = ggassoc_crosstab),
        upper = list(continuous = ggassoc_scatter,
                     combo = ggassoc_boxplot,
                     discrete = ggassoc_crosstab),
        diag = list(continuous = wrap("diagAxis", gridLabelSize = 3),
                           discrete = wrap("diagAxis", gridLabelSize = 3)))


Try the GDAtools package in your browser

Any scripts or data that you put into this service are public.

GDAtools documentation built on May 31, 2021, 5:08 p.m.