In jpkrooney/rcorex: An Implementation of Total Correlation Explanation

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

rcorex: Discover latent structure in high dimensional data.

Total correlation explanation is method for discovering latent structure in high dimensional data. Total correlation explanation has been implemented in Python as CorEx and related modules (https://github.com/gregversteeg/CorEx). The initial aim of rcorex is to implement Total Correlation Explanation in the R statistical software, specifically to replicate the functionality of the BioCorEx Python module ( https://github.com/gregversteeg/bio_corex ) which is designed to be suitable for biomedical datasets. This is implemented in the biocorex() command.

The theoretical framework behind the CorEx and Bio CorEx Python modules are laid out in the following academic papers:
1. Discovering Structure in High-Dimensional Data Through Correlation Explanation
2. Maximally Informative Hierarchical Representions of High-Dimensional Data
3. Comprehensive discovery of subsample gene expression components by information explanation: therapeutic implications in cancer

Installation

rcorex can be installed from Github:

# install.packages("remotes")
remotes::install_github("jpkrooney/rcorex")

Minimal example

To fit a CorEx model in rcorex we can use the biocorex() command. biocorex() accepts a data.frame or a matrix as input, however as with the Python implementation of Bio CorEx, all variables must have the same data-type and currently only "discrete" or "gaussian" data are allowed as marginal descriptions, which apply to all columns.

library(rcorex)
# make a small dataset
df1 <- matrix(c(0,0,0,0,0,
             0,0,0,1,1,
             1,1,1,0,0,
             1,1,1,1,1), ncol=5, byrow = TRUE)

# fit biocorex
set.seed(1234)
fit1 <- biocorex(df1, n_hidden = 2, dim_hidden = 2, marginal_description = "discrete", logpx_method = "pycorex")
plot(fit1)
summary(fit1)

# What was the total correlation for each hidden dimension ?
fit1$tcs

# Which variables were clustered together?
fit1$clusters

# Which labels were assigned to each row of data for hidden cluster 1?
fit1$labels[, 1]

# And for hidden cluster 2?
fit1$labels[, 2]

Hierarchical CorEx example

rcorex can search for hierarchical structure in data by using the labels output from an rcorex object as the input to the next layer in the hierarchy. This is shown in the following example using R's inbuilt iris dataset.

library(rcorex)
library(ggraph)

set.seed(1234)

# Load iris dataset
data("iris")

# Need to convert species factor variable to indicator variables
iris <- data.frame(iris , model.matrix(~iris$Species)[,2:3])
iris$Species <- NULL

# fit first layer of CorEx 
layer1 <- biocorex(iris, 3, 2, marginal_description = "gaussian", repeats = 5, logpx_method = "pycorex")

Note the use of the repeats = 5 argument to biocorex. This acts to run biocorex not once, but 5 times and biocorex automatically selects the run which produces the maximal TC to return to the user (unless the return_all_runs argument is set to TRUE).

We can then use the labels from layer1 as the input for a second layer of CorEx to discover hierarchical structure. Note that the value used for n_hidden should be lower in the second layer than it was in the first.

# fit second layer of CorEx
layer2 <- biocorex(layer1$labels, 1,2, marginal_description = "discrete", repeats = 5, logpx_method = "pycorex")

# make a network tidygraph of hierarchical layers
g_hier <- make_corex_tidygraph( list(layer1, layer2))

# Plot network graph of hierarchical layers
ggraph(g_hier, layout = "fr") +
    geom_node_point(aes(size = node_size), show.legend = FALSE) +
    geom_edge_hive(aes(width = thickness), alpha = 0.75, show.legend = FALSE) +
    scale_edge_width(range = c(0.3, 3)) +
    geom_node_text(aes(label = names), repel = TRUE) +
    theme_graph()