lcm_tree: wrapper function for fitting and summaries
In zhenkewu/lotR: Latent class analysis with Observed Trees in R (lotR)

lcm_tree

R Documentation

wrapper function for fitting and summaries

Description

wrapper function for fitting and summaries

Usage

lcm_tree(
  Y,
  leaf_ids,
  mytree,
  weighted_edge = TRUE,
  Z_obs = NULL,
  ci_level = 0.95,
  get_lcm_by_group = FALSE,
  update_hyper_freq = 50,
  print_freq = 10,
  quiet = FALSE,
  plot_fig = FALSE,
  shared_tau = FALSE,
  hyper_fixed = list(K = 3),
  tol = 1e-08,
  tol_hyper = 1e-04,
  max_iter = 5000,
  nrestarts = 3,
  keep_restarts = TRUE,
  parallel = TRUE,
  log_restarts = FALSE,
  log_dir = ".",
  vi_params_init = list(),
  hyperparams_init = list(),
  random_init = FALSE,
  random_init_vals = list(tau_lims = c(0.5, 1.5), psi_sd_frac = 0.2, phi_sd_frac = 0.2,
    mu_gamma_sd_frac = 0.2, mu_alpha_sd_frac = 0.2, u_sd_frac = 0.2),
  allow_continue = FALSE
)

Arguments

`Y`	An `n` x `J` matrix of binary (0 or 1) measurements, where each of `n` observation may belong to a leaf node; we aim to group `n` subjects into groups that are possibly coarser than the leaf-induced groups, and in each group, we have a homogeneous latent class model
`leaf_ids`	Character vector of length `n`. `leaf_ids[i]` is a string indicating the `leaf_id` for unit `i` (e.g., the sequence type of an E.coli isolate).
`mytree`	A directed `igraph` object. This is a tree representing the relationships among the `leaf_ids`. Every leaf node represents an `leaf_id`, and internal nodes represent leaf_ids subgroups consisting of their leaf descendants. All nodes of `mytree` must have unique names as given by `names(V(mytree))`. The names of the leaves must be equal to the number of unique elements of `leaf_ids`. The vertices of `mytree`, `V(mytree)`, may have an attribute `levels` containing integer values from `1` to `max(V(mytree)$levels)`. In this case, the `levels` attribute specifies groups of nodes that share common hyperparameters `rho[f]`, `tau_1[f]`, and `tau_2[f]`, where `f` is an integer that represents a level. If `V(mytree)$levels` is `NULL`, the default is two levels of hyperparameters: one for all leaf nodes, and one for all internal nodes. NB: this needs to be checked in the `design_tree()` # <———-??
`weighted_edge`	default to `TRUE`, which indicates the model will use the values of branch lengths (`h_pau` of `V(mytree)`).
`Z_obs`	a matrix of two columns, first column is subject ids (of all samples), the second column is a mix of `NA` or integers: `NA` for unknown class memberships and integers for known classes.
`ci_level`	A number between 0 and 1 giving the desired credible interval. For example, `ci_level = 0.95` (the default) returns a 95% credible interval
`get_lcm_by_group`	If `TRUE`, lotR will also return the maximum likelihood estimates of the coefficients for each leaf_ids group discovered by the model. Default is `TRUE`.
`update_hyper_freq`	How frequently to update hyperparameters. Default = every 50 iterations.
`print_freq`	How often to print out iteration number and current value of epsilon (the difference in objective function value for the two most recent iterations).
`quiet`	default to `FALSE`, which prints empirical class probabilities and updates on tau's
`plot_fig`	plot figure about `prob` (the probability of each node diffuse from the parent node, i.e., s_u=1 for using the slab component) and response profile (1st node)
`shared_tau`	logical: `TRUE` for sharing the `tau_1` for `alpha`'s in the same node; similarly for `tau_2` (`gamma`'s). Default is `FALSE`
`hyper_fixed`	Fixed values of hyperprior parameters for `rho`, the probability of a slab component (`s_u=1`). This should be a list with two elements: `a` and `b`, both numeric vectors of length `L`, representing the parameters of the beta prior on `rho` for each level, where `L` is the number of levels. Default is `list(a = rep(1, L), b = rep(1, L), tau_update_levels = c(1,2))` (uniform hyperprior) Other options include specifying included or excluded nodes via e.g., `s_u_zeroset = (1:265)[-c(1)],s_u_oneset = c(1))` to fit a single big LCM (collapsing all nodes except the root node, the first node)
`tol`	Convergence tolerance for the objective function. Default is `1E-8`.
`tol_hyper`	The convergence tolerance for the objective function between subsequent hyperparameter updates. Typically it is a more generous tolerance than `tol`. Default is `1E-4`.
`max_iter`	Maximum number of iterations of the VI algorithm. Default is `5000`. NB: check this number before package submission.
`nrestarts`	Number of random re-starts of the VI algorithm. The restart that gives the highest value of the objective function will be returned. It is recommended to choose `nrestarts > 1`; The default is `3`.
`keep_restarts`	If `TRUE`, the results from all random restarts will be returned. If `FALSE`, only the restart with the highest objective function is returned. ' Default is `TRUE`.
`parallel`	If `TRUE`, the random restarts will be run in parallel. It is recommended to first set the number of cores using `doParallel::registerDoParallel()`. Otherwise, the default number of cores specified by the `doParallel` package will be used. Default is `TRUE`.
`log_restarts`	If `TRUE`, when `nrestarts > 1` progress of each random restart will be logged to a text file in `log_dir`. If `FALSE` and `nrestarts > 1`, progress will not be shown. If `nrestarts = 1`, progress will always be printed to the console. Default is `FALSE`.
`log_dir`	Directory for logging progress of random restarts. Default is the working directory.
`vi_params_init, hyperparams_init`	Named lists containing initial values for the variational parameters and hyperparameters. Supplying good initial values can be challenging, and `lotR()` provides a way to guess initial values based on transformations of latent class model estimates for each individual leaf_ids (see `initialize_tree_lcm()`). The most common use for `vi_params_init` and `hyperparams_init` is to supply starting values based on previous output from `lotR()`; see the `vignette('lotR')` for examples. The user can provide initial values for all parameters or a subset. When initial values for one or more parameters are not supplied, the missing values will be filled in by `initialize_tree_lcm()`.
`random_init`	If `TRUE`, some random variability will be added to the initial values. The default is `FALSE`, unless `nrestarts > 1`, in which case `random_init` will be set to `TRUE` and a warning message will be printed. The amount of variability is determined by `random_init_vals`.
`random_init_vals`	If `random_init = TRUE`, this is a list containing the following parameters for randomly permuting the initial values: `tau_lims` a vector of length `2`, where `tau_lims[1]` is between `0` and `1`, and `tau_lims[2] > 1`. The initial values for the hyperparameter `tau` will be chosen uniformly at random in the range `(tau_init * tau_lims[1], tau_init * tau_lims[2])`, where `tau_init` is the initial value for `tau` either supplied in `hyperparams_init` or guessed using `initialize_tree_lcm()`. `psi_sd_frac` a value between `0` and `1`. The initial values for the auxiliary parameters `psi` will have a normal random variate added to them with standard deviation equal to `psi_sd_frac` multiplied by the initial value for eta either supplied in `hyperparams_init` or guessed using `initialize_tree_lcm()`. Absolute values are then taken for any values of `psi` that are `< 0`. `phi_sd_frac` same as above . `mu_gamma_sd_frac` a value between 0 and 1. The initial values for `mu` will have a normal random variate added to them with standard deviation equal to `mu_sd_frac` multiplied by the absolute value of the initial value for `mu_gamma_sd_frac` either supplied in `vi_params_init` or guessed using `initialize_tree_lcm()`. `mu_alpha_sd_frac` same as above. `u_sd_frac` a value between 0 and 1. The initial value for the node inclusion probabilities will first be transformed to the log odds scale to obtain `u`. A normal random variate will be added to `u` with standard deviation equal to u_sd_frac multiplied by the absolute value of the initial value for `u` either supplied in `vi_params_init` or guessed using `moretrees_init_logistic()`. `u` will then be transformed back to the probability scale.
`allow_continue`	logical, `TRUE` to save results so can continue running the VI updates with the last iteration from the old results.

Value

a list also of class "lcm_tree"

res <- make_list(mod,mod_restarts,mytree,dsgn,prob_est,est_ad_hoc) class(res) <- c("lcm_tree","list")

Examples

### Example workflow of `lotR`
### Zhenke Wu | zhenkewu@gmail.com
### August 04, 2020
### This example has true parameters following the tree structured priors.
rm(list=ls())
library(lotR)

data("lotR_example_edges")
library(igraph)
tr <- graph_from_edgelist(lotR_example_edges, directed = TRUE) # Plot tree
# Here in this example, we use equal weighted edges.

# If needed, first install ggtree
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("ggtree")
library(ggtree)
library(ggplot2)
ggtree(tr, ladderize = FALSE, layout = "slanted") +
  geom_tiplab(geom = "label") + geom_nodelab(geom = "label") +
  theme(plot.margin = unit(c(0, 1.5, 0, 0.2), "cm")) +
  coord_cartesian(clip = "off") + scale_y_reverse()

leaves <- names(igraph::V(tr)[degree(tr, mode = "out") == 0])

# set levels: leaves 2, non-leaves 1
igraph::V(tr)$levels <- rep(1,length(igraph::V(tr)))
igraph::V(tr)$levels[match(names(igraph::V(tr)[igraph::degree(tr, mode = "out") == 0]),
                   names(igraph::V(tr)))] <- 2

igraph::E(tr)$weight <- rep(1,length(E(tr)))
###############################################################################
## illustration of lotR on simulated data:
###############################################################################
data("lotR_example_data_tree") # assumes a single pi for all observations.
Y     <- lotR_example_data_tree$Y
Z_obs <- lotR_example_data_tree$Z_obs
curr_leaves <- lotR_example_data_tree$curr_leaves
theta <- lotR_example_data_tree$truth$itemprob
pi_mat <- lotR_example_data_tree$truth$pi_mat
# in the simulation, the 1st class has the highest level of probabilities,
# so we will reorder obtained estimates to match the meaning of classes.
Z     <- lotR_example_data_tree$truth$Z

## check if all leave names are present in the data.
setequal(leaves, unique(curr_leaves))
#> [1] TRUE
knitr::kable(table(curr_leaves))

K     <- nrow(theta)
p     <- length(V(tr))
J     <- ncol(Y)

dsgn0  <- design_tree(Y,curr_leaves,tr,weighted_edge = FALSE,Z_obs)

nrestarts <- 1
# doParallel::registerDoParallel(cores = nrestarts)
# log_dir <- tempdir()
# dir.create(log_dir)
set.seed(345083)
par(mfrow=c(3,3));plot(0,0)
mod0     <- lcm_tree(Y,curr_leaves,tr,
                     weighted_edge = !TRUE,
                     Z_obs         = Z_obs,
                     parallel      = TRUE,
                     hyper_fixed   = list(K=K,a=c(1,1),b=c(1,1),
                                          tau_update_levels = c(1,2),
                                          # should this be set as default?
                                          s_u_zeroset = NULL,
                                          s_u_oneset = c(1)),
                     # hyperparams_init = list(tau_1=matrix(9/4,nrow=2,ncol=K-1),
                     #                         tau_2=array(9/4,c(2,J,K))),
                     vi_params_init = list(prob=rep(0.95,p)),
                     random_init = !TRUE,
                     nrestarts     = nrestarts,
                     quiet         = !TRUE,
                     plot_fig      = TRUE,
                     shared_tau    = FALSE,
                     get_lcm_by_group = TRUE, #<--- check what are the prereq.
                     print_freq    = 10,update_hyper_freq = 50, max_iter = 200,
                     tol           = 1e-7,
                     tol_hyper     = 1e-4,
                     allow_continue = FALSE)#,
#log_restarts =!TRUE,
#log_dir = log_dir)

###############################################################################
# print out summaries of group specific estimates:
###############################################################################
# plot(mod0,layout = "slanted", horizontal = FALSE)
plot(mod0)

print(mod0)

which(mod0$mod$vi_params$prob>0.5)
unique(lotR_example_data_tree$truth$pi_mat)

###############################################################################
# print out performance metrics (only possible when doing simulation)
###############################################################################
res <- mod0

#------------------------------------------------#
#  estimates based on data-driven leaf groups.
#------------------------------------------------#

# 1. RMSE
## truth vs grouped:
# prob_est is with node_select (grouped estimates). NB: need to do individual
# leaf estimates.
# permute the estimated classes to best match the truth in terms of the response
# probability profiles.
itemprob_truth   <- t(lotR_example_data_tree$truth$itemprob)
itemprob_est     <- res$prob_est$theta_collapsed
# we set prob_gamma to be all zero except for the root node;
# so theta_collapsed is just the shared response probability profiles.
permute_est      <- opt_colpermB(itemprob_truth,itemprob_est)
# permute estimated classes to best match truth.

A <- lotR_example_data_tree$truth$pi_mat # leaf level; in truth,
                                       # we have group structure.
B <- res$prob_est$pi[,permute_est,drop=FALSE] #leaf level

rmse_bias_grp <- rmse_bias_AB(A,B)
print(rmse_bias_grp)

## Do the above for separate ad hoc estimates.

# 2. adjusted Rand index (aRI):
true_grouping <- factor(apply(A,1,paste,collapse=""),
                        levels = unique(apply(A,1,paste,collapse="")))
# the true grouping labels are following the order with
# which the leaf nodes were coded in the tree.

# mclust::adjustedRandIndex(true_grouping,res$prob_est$group)

# 3. misclassification
Az <- lotR_example_data_tree$truth$Z
Bz <- apply(res$mod$vi_params$rmat[,permute_est,drop=FALSE],1,which.max)
table(Az,Bz)
sum((Az!=Bz)/length(Az)) # misclassification rate

# 4. coverage (conditional: need to decide if a true group is in the estimated groups)
gtab <- table(true_grouping,res$prob_est$group)
# get the total number of estimated groups:
G <- nrow(gtab)
for (g in 1:G){# for each true group
  if (sum(gtab[g,]>0)==1){ # # did true group get split into multiple estimated groups?
    # if No...
    h <- as.integer(which(gtab[g,]>0)) # which estimated group contains
    # all leaf nodes in true group g:
    if (sum(gtab[,h]>0)==1){ # does this estimated group contain other
      # leaf nodes, not in true group g:
      # if no, then true group g and estimated group h are identical.
      # Proceed with estimating coverage.

      # Did the lotR estimate cover the truth?
      interval_est <- summary(res)[[h]]$est[permute_est,,drop=FALSE]
      v <-  which(as.integer(true_grouping)==g)[1] # pick a leaf node in the group.
      true_curr <- lotR_example_data_tree$truth$pi_mat[v,]
      covered <- (true_curr >= interval_est[,"cil_pi"]) &
        (true_curr <= interval_est[,"ciu_pi"])

      # Did the ad hoc group-specific LCM cover the truth?
      # NB: this is not done yet; we have point estimates,
      # but not yet extracted the posterior sd yet.
      cat("True Group: ", g, "\n")
      print(data.frame(truth = round(lotR_example_data_tree$truth$pi_mat[v,],3),
                       covered=covered))
      cat("----------\n")
    }
  }

}


#--------------------------------------#
# Individual leaf nodes estimates.
#--------------------------------------#

# 1. RMSE
## truth vs individual nodes:
## no node_select - the leaf nodes have distinct estimates of the pie

A <- lotR_example_data_tree$truth$pi_mat # leaf level; in truth,
                                         # we have group structure.
B <-  res$prob_est_indiv$pi[,permute_est,drop=FALSE] #leaf level

rmse_bias_indiv <- rmse_bias_AB(A,B)
print(rmse_bias_indiv)






#
# K = 2; simulate data
#

# # This code is not designed as an example for a particular function,
# # but it simulates the example data set: lotR_example_data_tree_K2
# ## This is the code for fixing K=2 issues.
#
# rm(list=ls())
# library(igraph)
# set.seed(2020)
# n = 3000
# tau   <- c(0.7,0.3)
# itemprob <- rbind(rep(rep(c(0.9, 0.9), each = 1),9),
#                   rep(rep(c(0.1, 0.1), each = 1),9))
#
# data("lotR_example_edges")
# mytree <- igraph::graph_from_edgelist(lotR_example_edges, directed = TRUE)
# # Plot tree
#
# nodes  <- names(igraph::V(mytree))
# leaves <- names(igraph::V(mytree)[degree(mytree, mode = "out") == 0])
# pL = length(leaves)
# p  = length(igraph::V(mytree))
#
# ######
# K = nrow(itemprob)
# # specify the nodes that have non-trivial alpha_u, this was called
# # xi_u, because xi_u = s_u*alpha_u and s_u = 1 if we set it in the simulation.
# alpha_mat = rbind(logit(prob2stick(tau)[-K]),
#                   c(-1),
#                   c(1),
#                   matrix(0,nrow=p-3,ncol=K-1)
# )
#
# # get lists of ancestors for each leaf_ids:
# d <- igraph::diameter(mytree,weights=NA)
# # need to set weight=NA to prevent the use of edge lengths in determining the diameter.
# ancestors <- igraph::ego(mytree, order = d + 1, nodes = leaves, mode = "in")
# ancestors <- sapply(ancestors, names, simplify = FALSE)
# ancestors <- sapply(ancestors, function(a, nodes) which(nodes %in% a),
#                     nodes = nodes, simplify = FALSE)
# names(ancestors) <- leaves
#
# # calculate the class probabilities for all leaf nodes; each leaf node
# # should have a K-dim vector that sums to one; Some nodes may share
# # the same set of K-dim probability vector, others may differ. There are
# # one or more groups of leaf nodes with distinct K-dim probability vectors.
# # Note the branch lengths may also be used here.
# pi_mat <- matrix(NA,nrow=pL,ncol=K)
# for (v in seq_along(leaves)){
#   pi_mat[v,-K] <- colSums(alpha_mat[ancestors[[v]],,drop=FALSE])
#   pi_mat[v,] <- tsb(c(expit(pi_mat[v,-K]),1))
# }
#
# # s = c(1, 1,1,0,0, rep(0,pL)) # effective nodes
# h_pau = rep(1,p)
#
# lotR_example_data_tree <- simulate_lcm_tree(n,itemprob,mytree,pi_mat,h_pau)
# table(lotR_example_data_tree$truth$Z)
#
# #save the simulated data to the R package for illustration:
# save(lotR_example_data_tree, file = "data/lotR_example_data_tree_K2.rda", compress = "xz")

zhenkewu/lotR documentation built on April 24, 2022, 2:36 a.m.