lcm_tree: wrapper function for fitting and summaries

View source: R/lcm_tree.R

lcm_treeR Documentation

wrapper function for fitting and summaries

Description

wrapper function for fitting and summaries

Usage

lcm_tree(
  Y,
  leaf_ids,
  mytree,
  weighted_edge = TRUE,
  Z_obs = NULL,
  ci_level = 0.95,
  get_lcm_by_group = FALSE,
  update_hyper_freq = 50,
  print_freq = 10,
  quiet = FALSE,
  plot_fig = FALSE,
  shared_tau = FALSE,
  hyper_fixed = list(K = 3),
  tol = 1e-08,
  tol_hyper = 1e-04,
  max_iter = 5000,
  nrestarts = 3,
  keep_restarts = TRUE,
  parallel = TRUE,
  log_restarts = FALSE,
  log_dir = ".",
  vi_params_init = list(),
  hyperparams_init = list(),
  random_init = FALSE,
  random_init_vals = list(tau_lims = c(0.5, 1.5), psi_sd_frac = 0.2, phi_sd_frac = 0.2,
    mu_gamma_sd_frac = 0.2, mu_alpha_sd_frac = 0.2, u_sd_frac = 0.2),
  allow_continue = FALSE
)

Arguments

Y

An n x J matrix of binary (0 or 1) measurements, where each of n observation may belong to a leaf node; we aim to group n subjects into groups that are possibly coarser than the leaf-induced groups, and in each group, we have a homogeneous latent class model

leaf_ids

Character vector of length n. leaf_ids[i] is a string indicating the leaf_id for unit i (e.g., the sequence type of an E.coli isolate).

mytree

A directed igraph object. This is a tree representing the relationships among the leaf_ids. Every leaf node represents an leaf_id, and internal nodes represent leaf_ids subgroups consisting of their leaf descendants. All nodes of mytree must have unique names as given by names(V(mytree)). The names of the leaves must be equal to the number of unique elements of leaf_ids. The vertices of mytree, V(mytree), may have an attribute levels containing integer values from 1 to max(V(mytree)$levels). In this case, the levels attribute specifies groups of nodes that share common hyperparameters rho[f], tau_1[f], and tau_2[f], where f is an integer that represents a level. If V(mytree)$levels is NULL, the default is two levels of hyperparameters: one for all leaf nodes, and one for all internal nodes. NB: this needs to be checked in the design_tree() # <———-??

weighted_edge

default to TRUE, which indicates the model will use the values of branch lengths (h_pau of V(mytree)).

Z_obs

a matrix of two columns, first column is subject ids (of all samples), the second column is a mix of NA or integers: NA for unknown class memberships and integers for known classes.

ci_level

A number between 0 and 1 giving the desired credible interval. For example, ci_level = 0.95 (the default) returns a 95% credible interval

get_lcm_by_group

If TRUE, lotR will also return the maximum likelihood estimates of the coefficients for each leaf_ids group discovered by the model. Default is TRUE.

update_hyper_freq

How frequently to update hyperparameters. Default = every 50 iterations.

print_freq

How often to print out iteration number and current value of epsilon (the difference in objective function value for the two most recent iterations).

quiet

default to FALSE, which prints empirical class probabilities and updates on tau's

plot_fig

plot figure about prob (the probability of each node diffuse from the parent node, i.e., s_u=1 for using the slab component) and response profile (1st node)

shared_tau

logical: TRUE for sharing the tau_1 for alpha's in the same node; similarly for tau_2 (gamma's). Default is FALSE

hyper_fixed

Fixed values of hyperprior parameters for rho, the probability of a slab component (s_u=1). This should be a list with two elements: a and b, both numeric vectors of length L, representing the parameters of the beta prior on rho for each level, where L is the number of levels. Default is list(a = rep(1, L), b = rep(1, L), tau_update_levels = c(1,2)) (uniform hyperprior) Other options include specifying included or excluded nodes via e.g., s_u_zeroset = (1:265)[-c(1)],s_u_oneset = c(1)) to fit a single big LCM (collapsing all nodes except the root node, the first node)

tol

Convergence tolerance for the objective function. Default is 1E-8.

tol_hyper

The convergence tolerance for the objective function between subsequent hyperparameter updates. Typically it is a more generous tolerance than tol. Default is 1E-4.

max_iter

Maximum number of iterations of the VI algorithm. Default is 5000. NB: check this number before package submission.

nrestarts

Number of random re-starts of the VI algorithm. The restart that gives the highest value of the objective function will be returned. It is recommended to choose nrestarts > 1; The default is 3.

keep_restarts

If TRUE, the results from all random restarts will be returned. If FALSE, only the restart with the highest objective function is returned. ' Default is TRUE.

parallel

If TRUE, the random restarts will be run in parallel. It is recommended to first set the number of cores using doParallel::registerDoParallel(). Otherwise, the default number of cores specified by the doParallel package will be used. Default is TRUE.

log_restarts

If TRUE, when nrestarts > 1 progress of each random restart will be logged to a text file in log_dir. If FALSE and nrestarts > 1, progress will not be shown. If nrestarts = 1, progress will always be printed to the console. Default is FALSE.

log_dir

Directory for logging progress of random restarts. Default is the working directory.

vi_params_init, hyperparams_init

Named lists containing initial values for the variational parameters and hyperparameters. Supplying good initial values can be challenging, and lotR() provides a way to guess initial values based on transformations of latent class model estimates for each individual leaf_ids (see initialize_tree_lcm()). The most common use for vi_params_init and hyperparams_init is to supply starting values based on previous output from lotR(); see the vignette('lotR') for examples. The user can provide initial values for all parameters or a subset. When initial values for one or more parameters are not supplied, the missing values will be filled in by initialize_tree_lcm().

random_init

If TRUE, some random variability will be added to the initial values. The default is FALSE, unless nrestarts > 1, in which case random_init will be set to TRUE and a warning message will be printed. The amount of variability is determined by random_init_vals.

random_init_vals

If random_init = TRUE, this is a list containing the following parameters for randomly permuting the initial values:

tau_lims

a vector of length 2, where tau_lims[1] is between 0 and 1, and tau_lims[2] > 1. The initial values for the hyperparameter tau will be chosen uniformly at random in the range (tau_init * tau_lims[1], tau_init * tau_lims[2]), where tau_init is the initial value for tau either supplied in hyperparams_init or guessed using initialize_tree_lcm().

psi_sd_frac

a value between 0 and 1. The initial values for the auxiliary parameters psi will have a normal random variate added to them with standard deviation equal to psi_sd_frac multiplied by the initial value for eta either supplied in hyperparams_init or guessed using initialize_tree_lcm(). Absolute values are then taken for any values of psi that are < 0.

phi_sd_frac

same as above

.

mu_gamma_sd_frac

a value between 0 and 1. The initial values for mu will have a normal random variate added to them with standard deviation equal to mu_sd_frac multiplied by the absolute value of the initial value for mu_gamma_sd_frac either supplied in vi_params_init or guessed using initialize_tree_lcm().

mu_alpha_sd_frac

same as above.

u_sd_frac

a value between 0 and 1. The initial value for the node inclusion probabilities will first be transformed to the log odds scale to obtain u. A normal random variate will be added to u with standard deviation equal to u_sd_frac multiplied by the absolute value of the initial value for u either supplied in vi_params_init or guessed using moretrees_init_logistic(). u will then be transformed back to the probability scale.

allow_continue

logical, TRUE to save results so can continue running the VI updates with the last iteration from the old results.

Value

a list also of class "lcm_tree"

res <- make_list(mod,mod_restarts,mytree,dsgn,prob_est,est_ad_hoc) class(res) <- c("lcm_tree","list")

See Also

Other lcm_tree functions: continue_lcm_tree()

Examples

### Example workflow of `lotR`
### Zhenke Wu | zhenkewu@gmail.com
### August 04, 2020
### This example has true parameters following the tree structured priors.
rm(list=ls())
library(lotR)

data("lotR_example_edges")
library(igraph)
tr <- graph_from_edgelist(lotR_example_edges, directed = TRUE) # Plot tree
# Here in this example, we use equal weighted edges.

# If needed, first install ggtree
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("ggtree")
library(ggtree)
library(ggplot2)
ggtree(tr, ladderize = FALSE, layout = "slanted") +
  geom_tiplab(geom = "label") + geom_nodelab(geom = "label") +
  theme(plot.margin = unit(c(0, 1.5, 0, 0.2), "cm")) +
  coord_cartesian(clip = "off") + scale_y_reverse()

leaves <- names(igraph::V(tr)[degree(tr, mode = "out") == 0])

# set levels: leaves 2, non-leaves 1
igraph::V(tr)$levels <- rep(1,length(igraph::V(tr)))
igraph::V(tr)$levels[match(names(igraph::V(tr)[igraph::degree(tr, mode = "out") == 0]),
                   names(igraph::V(tr)))] <- 2

igraph::E(tr)$weight <- rep(1,length(E(tr)))
###############################################################################
## illustration of lotR on simulated data:
###############################################################################
data("lotR_example_data_tree") # assumes a single pi for all observations.
Y     <- lotR_example_data_tree$Y
Z_obs <- lotR_example_data_tree$Z_obs
curr_leaves <- lotR_example_data_tree$curr_leaves
theta <- lotR_example_data_tree$truth$itemprob
pi_mat <- lotR_example_data_tree$truth$pi_mat
# in the simulation, the 1st class has the highest level of probabilities,
# so we will reorder obtained estimates to match the meaning of classes.
Z     <- lotR_example_data_tree$truth$Z

## check if all leave names are present in the data.
setequal(leaves, unique(curr_leaves))
#> [1] TRUE
knitr::kable(table(curr_leaves))

K     <- nrow(theta)
p     <- length(V(tr))
J     <- ncol(Y)

dsgn0  <- design_tree(Y,curr_leaves,tr,weighted_edge = FALSE,Z_obs)

nrestarts <- 1
# doParallel::registerDoParallel(cores = nrestarts)
# log_dir <- tempdir()
# dir.create(log_dir)
set.seed(345083)
par(mfrow=c(3,3));plot(0,0)
mod0     <- lcm_tree(Y,curr_leaves,tr,
                     weighted_edge = !TRUE,
                     Z_obs         = Z_obs,
                     parallel      = TRUE,
                     hyper_fixed   = list(K=K,a=c(1,1),b=c(1,1),
                                          tau_update_levels = c(1,2),
                                          # should this be set as default?
                                          s_u_zeroset = NULL,
                                          s_u_oneset = c(1)),
                     # hyperparams_init = list(tau_1=matrix(9/4,nrow=2,ncol=K-1),
                     #                         tau_2=array(9/4,c(2,J,K))),
                     vi_params_init = list(prob=rep(0.95,p)),
                     random_init = !TRUE,
                     nrestarts     = nrestarts,
                     quiet         = !TRUE,
                     plot_fig      = TRUE,
                     shared_tau    = FALSE,
                     get_lcm_by_group = TRUE, #<--- check what are the prereq.
                     print_freq    = 10,update_hyper_freq = 50, max_iter = 200,
                     tol           = 1e-7,
                     tol_hyper     = 1e-4,
                     allow_continue = FALSE)#,
#log_restarts =!TRUE,
#log_dir = log_dir)

###############################################################################
# print out summaries of group specific estimates:
###############################################################################
# plot(mod0,layout = "slanted", horizontal = FALSE)
plot(mod0)

print(mod0)

which(mod0$mod$vi_params$prob>0.5)
unique(lotR_example_data_tree$truth$pi_mat)

###############################################################################
# print out performance metrics (only possible when doing simulation)
###############################################################################
res <- mod0

#------------------------------------------------#
#  estimates based on data-driven leaf groups.
#------------------------------------------------#

# 1. RMSE
## truth vs grouped:
# prob_est is with node_select (grouped estimates). NB: need to do individual
# leaf estimates.
# permute the estimated classes to best match the truth in terms of the response
# probability profiles.
itemprob_truth   <- t(lotR_example_data_tree$truth$itemprob)
itemprob_est     <- res$prob_est$theta_collapsed
# we set prob_gamma to be all zero except for the root node;
# so theta_collapsed is just the shared response probability profiles.
permute_est      <- opt_colpermB(itemprob_truth,itemprob_est)
# permute estimated classes to best match truth.

A <- lotR_example_data_tree$truth$pi_mat # leaf level; in truth,
                                       # we have group structure.
B <- res$prob_est$pi[,permute_est,drop=FALSE] #leaf level

rmse_bias_grp <- rmse_bias_AB(A,B)
print(rmse_bias_grp)

## Do the above for separate ad hoc estimates.

# 2. adjusted Rand index (aRI):
true_grouping <- factor(apply(A,1,paste,collapse=""),
                        levels = unique(apply(A,1,paste,collapse="")))
# the true grouping labels are following the order with
# which the leaf nodes were coded in the tree.

# mclust::adjustedRandIndex(true_grouping,res$prob_est$group)

# 3. misclassification
Az <- lotR_example_data_tree$truth$Z
Bz <- apply(res$mod$vi_params$rmat[,permute_est,drop=FALSE],1,which.max)
table(Az,Bz)
sum((Az!=Bz)/length(Az)) # misclassification rate

# 4. coverage (conditional: need to decide if a true group is in the estimated groups)
gtab <- table(true_grouping,res$prob_est$group)
# get the total number of estimated groups:
G <- nrow(gtab)
for (g in 1:G){# for each true group
  if (sum(gtab[g,]>0)==1){ # # did true group get split into multiple estimated groups?
    # if No...
    h <- as.integer(which(gtab[g,]>0)) # which estimated group contains
    # all leaf nodes in true group g:
    if (sum(gtab[,h]>0)==1){ # does this estimated group contain other
      # leaf nodes, not in true group g:
      # if no, then true group g and estimated group h are identical.
      # Proceed with estimating coverage.

      # Did the lotR estimate cover the truth?
      interval_est <- summary(res)[[h]]$est[permute_est,,drop=FALSE]
      v <-  which(as.integer(true_grouping)==g)[1] # pick a leaf node in the group.
      true_curr <- lotR_example_data_tree$truth$pi_mat[v,]
      covered <- (true_curr >= interval_est[,"cil_pi"]) &
        (true_curr <= interval_est[,"ciu_pi"])

      # Did the ad hoc group-specific LCM cover the truth?
      # NB: this is not done yet; we have point estimates,
      # but not yet extracted the posterior sd yet.
      cat("True Group: ", g, "\n")
      print(data.frame(truth = round(lotR_example_data_tree$truth$pi_mat[v,],3),
                       covered=covered))
      cat("----------\n")
    }
  }

}


#--------------------------------------#
# Individual leaf nodes estimates.
#--------------------------------------#

# 1. RMSE
## truth vs individual nodes:
## no node_select - the leaf nodes have distinct estimates of the pie

A <- lotR_example_data_tree$truth$pi_mat # leaf level; in truth,
                                         # we have group structure.
B <-  res$prob_est_indiv$pi[,permute_est,drop=FALSE] #leaf level

rmse_bias_indiv <- rmse_bias_AB(A,B)
print(rmse_bias_indiv)






#
# K = 2; simulate data
#

# # This code is not designed as an example for a particular function,
# # but it simulates the example data set: lotR_example_data_tree_K2
# ## This is the code for fixing K=2 issues.
#
# rm(list=ls())
# library(igraph)
# set.seed(2020)
# n = 3000
# tau   <- c(0.7,0.3)
# itemprob <- rbind(rep(rep(c(0.9, 0.9), each = 1),9),
#                   rep(rep(c(0.1, 0.1), each = 1),9))
#
# data("lotR_example_edges")
# mytree <- igraph::graph_from_edgelist(lotR_example_edges, directed = TRUE)
# # Plot tree
#
# nodes  <- names(igraph::V(mytree))
# leaves <- names(igraph::V(mytree)[degree(mytree, mode = "out") == 0])
# pL = length(leaves)
# p  = length(igraph::V(mytree))
#
# ######
# K = nrow(itemprob)
# # specify the nodes that have non-trivial alpha_u, this was called
# # xi_u, because xi_u = s_u*alpha_u and s_u = 1 if we set it in the simulation.
# alpha_mat = rbind(logit(prob2stick(tau)[-K]),
#                   c(-1),
#                   c(1),
#                   matrix(0,nrow=p-3,ncol=K-1)
# )
#
# # get lists of ancestors for each leaf_ids:
# d <- igraph::diameter(mytree,weights=NA)
# # need to set weight=NA to prevent the use of edge lengths in determining the diameter.
# ancestors <- igraph::ego(mytree, order = d + 1, nodes = leaves, mode = "in")
# ancestors <- sapply(ancestors, names, simplify = FALSE)
# ancestors <- sapply(ancestors, function(a, nodes) which(nodes %in% a),
#                     nodes = nodes, simplify = FALSE)
# names(ancestors) <- leaves
#
# # calculate the class probabilities for all leaf nodes; each leaf node
# # should have a K-dim vector that sums to one; Some nodes may share
# # the same set of K-dim probability vector, others may differ. There are
# # one or more groups of leaf nodes with distinct K-dim probability vectors.
# # Note the branch lengths may also be used here.
# pi_mat <- matrix(NA,nrow=pL,ncol=K)
# for (v in seq_along(leaves)){
#   pi_mat[v,-K] <- colSums(alpha_mat[ancestors[[v]],,drop=FALSE])
#   pi_mat[v,] <- tsb(c(expit(pi_mat[v,-K]),1))
# }
#
# # s = c(1, 1,1,0,0, rep(0,pL)) # effective nodes
# h_pau = rep(1,p)
#
# lotR_example_data_tree <- simulate_lcm_tree(n,itemprob,mytree,pi_mat,h_pau)
# table(lotR_example_data_tree$truth$Z)
#
# #save the simulated data to the R package for illustration:
# save(lotR_example_data_tree, file = "data/lotR_example_data_tree_K2.rda", compress = "xz")


zhenkewu/lotR documentation built on April 24, 2022, 2:36 a.m.