knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
Total correlation explanation is method for discovering latent structure in high dimensional data. Total correlation explanation has been implemented in Python as CorEx and related modules (https://github.com/gregversteeg/CorEx). The initial aim of rcorex is to implement Total Correlation Explanation in the R statistical software, specifically to replicate the functionality of the BioCorEx Python module ( https://github.com/gregversteeg/bio_corex ) which is designed to be suitable for biomedical datasets. This is implemented in the biocorex()
command.
The theoretical framework behind the CorEx and Bio CorEx Python modules are laid out in the following academic papers:
1. Discovering Structure in High-Dimensional Data Through Correlation Explanation
2. Maximally Informative Hierarchical Representions of High-Dimensional Data
3. Comprehensive discovery of subsample gene expression components by information explanation: therapeutic implications in cancer
rcorex can be installed from Github:
# install.packages("remotes") remotes::install_github("jpkrooney/rcorex")
To fit a CorEx model in rcorex we can use the biocorex()
command. biocorex()
accepts a data.frame or a matrix as input, however as with the Python implementation of Bio CorEx, all variables must have the same data-type and currently only "discrete" or "gaussian" data are allowed as marginal descriptions, which apply to all columns.
library(rcorex) # make a small dataset df1 <- matrix(c(0,0,0,0,0, 0,0,0,1,1, 1,1,1,0,0, 1,1,1,1,1), ncol=5, byrow = TRUE) # fit biocorex set.seed(1234) fit1 <- biocorex(df1, n_hidden = 2, dim_hidden = 2, marginal_description = "discrete", logpx_method = "pycorex") plot(fit1) summary(fit1) # What was the total correlation for each hidden dimension ? fit1$tcs # Which variables were clustered together? fit1$clusters # Which labels were assigned to each row of data for hidden cluster 1? fit1$labels[, 1] # And for hidden cluster 2? fit1$labels[, 2]
rcorex
can search for hierarchical structure in data by using the labels output from an rcorex
object as the input to the next layer in the hierarchy. This is shown in the following example using R's inbuilt iris
dataset.
library(rcorex) library(ggraph) set.seed(1234) # Load iris dataset data("iris") # Need to convert species factor variable to indicator variables iris <- data.frame(iris , model.matrix(~iris$Species)[,2:3]) iris$Species <- NULL # fit first layer of CorEx layer1 <- biocorex(iris, 3, 2, marginal_description = "gaussian", repeats = 5, logpx_method = "pycorex")
Note the use of the repeats = 5
argument to biocorex
. This acts to run biocorex
not once, but 5 times and biocorex
automatically selects the run which produces the maximal TC to return to the user (unless the return_all_runs
argument is set to TRUE
).
We can then use the labels from layer1
as the input for a second layer of CorEx to discover hierarchical structure. Note that the value used for n_hidden should be lower in the second layer than it was in the first.
# fit second layer of CorEx layer2 <- biocorex(layer1$labels, 1,2, marginal_description = "discrete", repeats = 5, logpx_method = "pycorex") # make a network tidygraph of hierarchical layers g_hier <- make_corex_tidygraph( list(layer1, layer2)) # Plot network graph of hierarchical layers ggraph(g_hier, layout = "fr") + geom_node_point(aes(size = node_size), show.legend = FALSE) + geom_edge_hive(aes(width = thickness), alpha = 0.75, show.legend = FALSE) + scale_edge_width(range = c(0.3, 3)) + geom_node_text(aes(label = names), repel = TRUE) + theme_graph()
Additional hierarchical layers can be identified in larger datasets.
If you find rcorex
useful in academic work please cite the package as follows:
citation("rcorex")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.