biocorex: Fit biocorex to a dataset
In jpkrooney/rcorex: An Implementation of Total Correlation Explanation

biocorex

R Documentation

Fit biocorex to a dataset

Description

Function which implements the CorEx algorithm for data with features of typical biomedical data such as continuous variables, missing data and under-sampled data.

Usage

biocorex(
  data,
  n_hidden = 1,
  dim_hidden = 2,
  marginal_description = "gaussian",
  smooth_marginals = FALSE,
  eps = 1e-06,
  verbose = FALSE,
  repeats = 1,
  return_all_runs = FALSE,
  max_iter = 100,
  logpx_method = "pycorex"
)

Arguments

`data`	Data provided by user. For biocorex data can either be continuous (gaussian) or discrete (consectutive integers 0, 1, 2, 3...etc). Data types cannot by mixed in this implementation.
`n_hidden`	An integer number of hidden variables to search for. Default = 1.
`dim_hidden`	Each hidden unit can take `dim_hidden` discrete values. Default = 2
`marginal_description`	Character string which determines the marginal distribution of the data. single marginal description applies to all variables in biocorex. Can be "gaussian" or "discrete". Default is "gaussian".
`smooth_marginals`	Boolean (TRUE/FALSE) which indicates whether Bayesian smoothing of marginal estimates should be used.
`eps`	The maximal change in TC across 10 iterations needed signal convergence
`verbose`	Default FALSE. If TRUE, biocorex feeds back to user the iteration count and TCS each iteration. Useful to see progression if fitting a larger dataset.
`repeats`	How many times to run biocorex on the data using random initial values. Corex will return the run which leads to the maximum TC. Default is 1. For a new dataset, recommend to leave it as 1 to see how long biocorex takes, however for more trustworthy results a higher numbers recommended (e.g. 25).
`return_all_runs`	Default FALSE. If FALSE biocorex returns a single object of class rcorex. If TRUE biocorex returns all runs of biocorex as a list - the length of which = `repeats`. In this case the returned results are not rcorex objects, but have the same components of an rcorex object with class list.
`max_iter`	numeric. Maximum number of iterations before ending. Default = 100
`logpx_method`	EXPERIMENTAL - A character string that controls the method used to calculate log_p_xi. If "pycorex" uses the same method as the Python version of biocorex, if set to "mean" calculates an estimate of log_p_xi by averaging across n_hidden estimates. NOTE, that mean may become the default option after further testing.

Details

This function is a port of the original biocorex function in Python by Greg Ver Steeg: https://github.com/gregversteeg/bio_corex. Reference: Greg Ver Steeg and Aram Galstyan. "Discovering Structure in High-Dimensional Data Through Correlation Explanation." NIPS, 2014. arXiv preprint arXiv:1406.1222.

Value

Returns either a rcorex object or a list of repeated runs as determined by the return_all_runs argument. An rcorex object is a list that contains the following components: #'

data - the user data supplied in call to corex.
call - the call used to run corex.
tcs - a vector of TC for n_hidden variables.
alpha - a 2D adjaceny matrix of connections between input variables and hidden variables.
p_y_given_x - a 3D array of numerics in range (0, 1), that represent the probability that each observed x variable belongs to n_hidden latent variables of dimension dim_hidden. p_y_given_x has dimensions (n_hidden, n_samples, dim_hidden).
theta - a list of the estimated parameters
log_p_y - a 2D matrix representing the log of the marginal probability of the latent variables.
log_z - a 2D matrix containing the pointwise estimate of total correlation explained by each latent variable for each sample - this is used to estimate overall total correlation.
dim_visible - only present if discrete marginals were specified. Lists the number of discrete levels that exist in the data.
iterations - the number of iterations for which the algorithm ran.
tc_history - a list that records the TC results for each iteration of the algorithm.
marginal_description - a character string which determines the marginal distribution of the data.
mis - an array that specifies the mutual information between each observed variable and hidden variable.
clusters - a vector that assigns a hidden variable label to each input variable.
labels - a 2D matrix of dimensions (nrow(data), n_hidden) that assigns a dimension label for each hidden variable to each row of data.