epicorex: Fit epicorex to a dataset
In jpkrooney/rcorex: An Implementation of Total Correlation Explanation

epicorex

R Documentation

Fit epicorex to a dataset

Description

Function which implements the CorEx algorithm for data with features of typical biomedical data such as continuous variables, missing data and under-sampled data, and extended to accommodate mixed marginal descriptions (i.e. mixed data types)

Usage

epicorex(
  data,
  n_hidden = 1,
  dim_hidden = 2,
  marginal_description,
  smooth_marginals = FALSE,
  eps = 1e-06,
  verbose = FALSE,
  repeats = 1,
  return_all_runs = FALSE,
  max_iter = 100,
  negcheck_iter = 30,
  neg_limit = -1,
  logpx_method = "pycorex"
)

Arguments

`data`	Data provided by user. Allows for mixed data types. Eg categorical or binomial or continuous or discrete in same dataset
`n_hidden`	An integer number of hidden variables to search for. Default = 1.
`dim_hidden`	Each hidden unit can take `dim_hidden` discrete values. Default = 2
`marginal_description`	Character string which determines the marginal distribution of the data. For epicorex, marginal_description must be a vector of strings of length equal to the number of columns in `data`. Allowable marginal descriptions are: gaussian, discrete, bernoulli currently - more may be added later.
`smooth_marginals`	Boolean (TRUE/FALSE) which indicates whether Bayesian smoothing of marginal estimates should be used.
`eps`	The maximal change in TC across 10 iterations needed signal convergence
`verbose`	Default FALSE. If TRUE, epicorex feeds back to user the iteration count and TCS each iteration. Useful to see progression if fitting a larger dataset.
`repeats`	How many times to run epicorex on the data using random initial values. Corex will return the run which leads to the maximum TC. Default is 1. For a new dataset, recommend to leave it as 1 to see how long epicorex takes, however for more trustworthy results a higher numbers recommended (e.g. 25).
`return_all_runs`	Default FALSE. If FALSE epicorex returns a single object of class rcorex. If TRUE epicorex returns all runs of epicorex as a list - the length of which = `repeats`. In this case the returned results are not rcorex objects, but have the same components of an rcorex object with class list.
`max_iter`	numeric. Maximum number of iterations before ending. Default = 100
`negcheck_iter`	numeric. Number of iterations at which to check for persistent negative total tcs. IF detected the corex run stops. The default is 30.
`neg_limit`	numeric. At the negcheck_iter number of iterations, all prior iterations are checked versus this value. If all values are below this value, persistent negative tcs is declared. This is an indication that some data is not well described by the marginal description used. The default is -1.
`logpx_method`	EXPERIMENTAL - A character string that controls the method used to calculate log_p_xi. If "pycorex" uses the same method as the Python version of biocorex, if set to "mean" calculates an estimate of log_p_xi by averaging across n_hidden estimates. NOTE, that mean may become the default option after further testing.

Details

This function is an extension of the biocorex function to allow for mixed data-types as typically found in epidemiological datasets - e.g. categorical data + continuous data for example.

Value

Returns either a rcorex object or a list of repeated runs as determined by the return_all_runs argument. An rcorex object is a list that contains the following components: #'

data - the user data supplied in call to corex.
call - the call used to run corex.
tcs - a vector of TC for n_hidden variables.
alpha - a 2D adjaceny matrix of connections between input variables and hidden variables.
p_y_given_x - a 3D array of numerics in range (0, 1), that represent the probability that each observed x variable belongs to n_hidden latent variables of dimension dim_hidden. p_y_given_x has dimensions (n_hidden, n_samples, dim_hidden).
theta - a list of the estimated parameters
log_p_y - a 2D matrix representing the log of the marginal probability of the latent variables.
log_z - a 2D matrix containing the pointwise estimate of total correlation explained by each latent variable for each sample - this is used to estimate overall total correlation.
dim_visible - only present if discrete marginals were specified. Lists the number of discrete levels that exist in the data.
iterations - the number of iterations for which the algorithm ran.
tc_history - a list that records the TC results for each iteration of the algorithm.
marginal_description - a character string which determines the marginal distribution of the data.
mis - an array that specifies the mutual information between each observed variable and hidden variable.
clusters - a vector that assigns a hidden variable label to each input variable.
labels - a 2D matrix of dimensions (nrow(data), n_hidden) that assigns a dimension label for each hidden variable to each row of data.