epicorex: Fit epicorex to a dataset

View source: R/epicorex.R

epicorexR Documentation

Fit epicorex to a dataset

Description

Function which implements the CorEx algorithm for data with features of typical biomedical data such as continuous variables, missing data and under-sampled data, and extended to accommodate mixed marginal descriptions (i.e. mixed data types)

Usage

epicorex(
  data,
  n_hidden = 1,
  dim_hidden = 2,
  marginal_description,
  smooth_marginals = FALSE,
  eps = 1e-06,
  verbose = FALSE,
  repeats = 1,
  return_all_runs = FALSE,
  max_iter = 100,
  negcheck_iter = 30,
  neg_limit = -1,
  logpx_method = "pycorex"
)

Arguments

data

Data provided by user. Allows for mixed data types. Eg categorical or binomial or continuous or discrete in same dataset

n_hidden

An integer number of hidden variables to search for. Default = 1.

dim_hidden

Each hidden unit can take dim_hidden discrete values. Default = 2

marginal_description

Character string which determines the marginal distribution of the data. For epicorex, marginal_description must be a vector of strings of length equal to the number of columns in data. Allowable marginal descriptions are: gaussian, discrete, bernoulli currently - more may be added later.

smooth_marginals

Boolean (TRUE/FALSE) which indicates whether Bayesian smoothing of marginal estimates should be used.

eps

The maximal change in TC across 10 iterations needed signal convergence

verbose

Default FALSE. If TRUE, epicorex feeds back to user the iteration count and TCS each iteration. Useful to see progression if fitting a larger dataset.

repeats

How many times to run epicorex on the data using random initial values. Corex will return the run which leads to the maximum TC. Default is 1. For a new dataset, recommend to leave it as 1 to see how long epicorex takes, however for more trustworthy results a higher numbers recommended (e.g. 25).

return_all_runs

Default FALSE. If FALSE epicorex returns a single object of class rcorex. If TRUE epicorex returns all runs of epicorex as a list - the length of which = repeats. In this case the returned results are not rcorex objects, but have the same components of an rcorex object with class list.

max_iter

numeric. Maximum number of iterations before ending. Default = 100

negcheck_iter

numeric. Number of iterations at which to check for persistent negative total tcs. IF detected the corex run stops. The default is 30.

neg_limit

numeric. At the negcheck_iter number of iterations, all prior iterations are checked versus this value. If all values are below this value, persistent negative tcs is declared. This is an indication that some data is not well described by the marginal description used. The default is -1.

logpx_method

EXPERIMENTAL - A character string that controls the method used to calculate log_p_xi. If "pycorex" uses the same method as the Python version of biocorex, if set to "mean" calculates an estimate of log_p_xi by averaging across n_hidden estimates. NOTE, that mean may become the default option after further testing.

Details

This function is an extension of the biocorex function to allow for mixed data-types as typically found in epidemiological datasets - e.g. categorical data + continuous data for example.

Value

Returns either a rcorex object or a list of repeated runs as determined by the return_all_runs argument. An rcorex object is a list that contains the following components: #'

  1. data - the user data supplied in call to corex.

  2. call - the call used to run corex.

  3. tcs - a vector of TC for n_hidden variables.

  4. alpha - a 2D adjaceny matrix of connections between input variables and hidden variables.

  5. p_y_given_x - a 3D array of numerics in range (0, 1), that represent the probability that each observed x variable belongs to n_hidden latent variables of dimension dim_hidden. p_y_given_x has dimensions (n_hidden, n_samples, dim_hidden).

  6. theta - a list of the estimated parameters

  7. log_p_y - a 2D matrix representing the log of the marginal probability of the latent variables.

  8. log_z - a 2D matrix containing the pointwise estimate of total correlation explained by each latent variable for each sample - this is used to estimate overall total correlation.

  9. dim_visible - only present if discrete marginals were specified. Lists the number of discrete levels that exist in the data.

  10. iterations - the number of iterations for which the algorithm ran.

  11. tc_history - a list that records the TC results for each iteration of the algorithm.

  12. marginal_description - a character string which determines the marginal distribution of the data.

  13. mis - an array that specifies the mutual information between each observed variable and hidden variable.

  14. clusters - a vector that assigns a hidden variable label to each input variable.

  15. labels - a 2D matrix of dimensions (nrow(data), n_hidden) that assigns a dimension label for each hidden variable to each row of data.


jpkrooney/rcorex documentation built on July 25, 2022, 1:37 a.m.