biocorex: Fit biocorex to a dataset

View source: R/biocorex.R

biocorexR Documentation

Fit biocorex to a dataset

Description

Function which implements the CorEx algorithm for data with features of typical biomedical data such as continuous variables, missing data and under-sampled data.

Usage

biocorex(
  data,
  n_hidden = 1,
  dim_hidden = 2,
  marginal_description = "gaussian",
  smooth_marginals = FALSE,
  eps = 1e-06,
  verbose = FALSE,
  repeats = 1,
  return_all_runs = FALSE,
  max_iter = 100,
  logpx_method = "pycorex"
)

Arguments

data

Data provided by user. For biocorex data can either be continuous (gaussian) or discrete (consectutive integers 0, 1, 2, 3...etc). Data types cannot by mixed in this implementation.

n_hidden

An integer number of hidden variables to search for. Default = 1.

dim_hidden

Each hidden unit can take dim_hidden discrete values. Default = 2

marginal_description

Character string which determines the marginal distribution of the data. single marginal description applies to all variables in biocorex. Can be "gaussian" or "discrete". Default is "gaussian".

smooth_marginals

Boolean (TRUE/FALSE) which indicates whether Bayesian smoothing of marginal estimates should be used.

eps

The maximal change in TC across 10 iterations needed signal convergence

verbose

Default FALSE. If TRUE, biocorex feeds back to user the iteration count and TCS each iteration. Useful to see progression if fitting a larger dataset.

repeats

How many times to run biocorex on the data using random initial values. Corex will return the run which leads to the maximum TC. Default is 1. For a new dataset, recommend to leave it as 1 to see how long biocorex takes, however for more trustworthy results a higher numbers recommended (e.g. 25).

return_all_runs

Default FALSE. If FALSE biocorex returns a single object of class rcorex. If TRUE biocorex returns all runs of biocorex as a list - the length of which = repeats. In this case the returned results are not rcorex objects, but have the same components of an rcorex object with class list.

max_iter

numeric. Maximum number of iterations before ending. Default = 100

logpx_method

EXPERIMENTAL - A character string that controls the method used to calculate log_p_xi. If "pycorex" uses the same method as the Python version of biocorex, if set to "mean" calculates an estimate of log_p_xi by averaging across n_hidden estimates. NOTE, that mean may become the default option after further testing.

Details

This function is a port of the original biocorex function in Python by Greg Ver Steeg: https://github.com/gregversteeg/bio_corex. Reference: Greg Ver Steeg and Aram Galstyan. "Discovering Structure in High-Dimensional Data Through Correlation Explanation." NIPS, 2014. arXiv preprint arXiv:1406.1222.

Value

Returns either a rcorex object or a list of repeated runs as determined by the return_all_runs argument. An rcorex object is a list that contains the following components: #'

  1. data - the user data supplied in call to corex.

  2. call - the call used to run corex.

  3. tcs - a vector of TC for n_hidden variables.

  4. alpha - a 2D adjaceny matrix of connections between input variables and hidden variables.

  5. p_y_given_x - a 3D array of numerics in range (0, 1), that represent the probability that each observed x variable belongs to n_hidden latent variables of dimension dim_hidden. p_y_given_x has dimensions (n_hidden, n_samples, dim_hidden).

  6. theta - a list of the estimated parameters

  7. log_p_y - a 2D matrix representing the log of the marginal probability of the latent variables.

  8. log_z - a 2D matrix containing the pointwise estimate of total correlation explained by each latent variable for each sample - this is used to estimate overall total correlation.

  9. dim_visible - only present if discrete marginals were specified. Lists the number of discrete levels that exist in the data.

  10. iterations - the number of iterations for which the algorithm ran.

  11. tc_history - a list that records the TC results for each iteration of the algorithm.

  12. marginal_description - a character string which determines the marginal distribution of the data.

  13. mis - an array that specifies the mutual information between each observed variable and hidden variable.

  14. clusters - a vector that assigns a hidden variable label to each input variable.

  15. labels - a 2D matrix of dimensions (nrow(data), n_hidden) that assigns a dimension label for each hidden variable to each row of data.


jpkrooney/rcorex documentation built on July 25, 2022, 1:37 a.m.