design_doubletree: Organize the data around two rooted weighted trees

View source: R/design_doubletree.R

design_doubletreeR Documentation

Organize the data around two rooted weighted trees

Description

Prepare the data for fitting nested latent class models with two separate tree structures (i.e., hierarchies) over two sets of leaf labels. Each observation is associated with a particular combination of label values, one from each set of leaves. For example, we have domain and cause hierarchies in verbal autopsy applications.

Usage

design_doubletree(Y, leaf_ids, mytrees, weighted_edges = c(FALSE, FALSE))

Arguments

Y

N by J binary data matrix; rows for subjects, columns for features; missing entries, if present, are encoded by NAs. Note that the rows of Y will be reordered twice, once according to leaf nodes/missing in tree1, the second time according to the leaf nodes in tree2.

leaf_ids

A list containing two elements. For example, the first can be a vector of character strings for leaf nodes for each observation, representing the leaf nodes in tree1; similarly for the second tree; NA represents for missing leaf info. For each observation, the pair of labels indicates the leaf memberships in the two trees contained in mytrees, respectively. For example, in verbal autopsy (VA) applications, for data in the source domains, we must have both leaf ids observed; for data in the target domain, we can have the leaf id in the first tree (i.e., cause tree) NA, indicating an unobserved cause of death (hence unknown to analysts which leaf in the cause tree should an observation be placed). In this package, we only allow NA for leaf labels in tree1; tree2's leaves represent domains, which must be known when doing domain adaptation. Currently the NA in tree1, if present, can only contain ALL subjects from a single leaf node in tree2. NB: Extensions to deal with CODs that are partially observed in the target domain need additional work...

mytrees

A list of two elements: tree1,tree2; both are igraph objects. They may contain attributes, such as node, edge, edge lengths. (NB: need refinement)

weighted_edges

a vector of logical values, indicating whether to use weighted edges in the two trees; default to c(FALSE,FALSE), i.e., not using weighted edges and assuming the edges in the trees are unit lengths.

Value

A list of three elements that contains preprocessed data that are organized around the two trees; Y,A,leaf_ids_units,leaf_ids,leaf_ids_nodes,ancestors,h_pau,levels,

Also see the arguments of initialize_nlcm_doubletree() for details, because the present function is used to set the stage for initialization.

  • resA2

    • A A matrix of p1 by p1; each column contains some 1s, indicating the node in that column is an ancestor of the node represented in the row. Ancestor of a node include that node itself.

    • A_leaves A matrix of pL1 by p1; A submatrix of A that represents the ancestor matrix but only for leaves

    • leaf_ids A vector of Nintegers; ordered by the leaves as specified by mytrees[[1]]; NA represents missing leaf labels.

    • leaf_ids_units A list of length pL1, each element is a vector of subject ids belonging to each leaf node; the list will be of length pL1+1 if missing leaf labels are present, in which case the final element NA_tree1the ids of the observations with missing leaf labels. NB: can it accommodate subjects from distinct leaf nodes in tree2?

    • leaf_ids_nodes leaf descendants based on the tree1 structure; a list of length p1

    • ancestors a list of length pL1, each element is the vector of ancestors (between 1 and p1; id is among all nodes); NB: what is the mapping between integers and node meanings?

    • edge_lengths a list of length pL1, each element is a numeric vector of lengths of edges in the path from the root node to the leaf. It is computed based on E(mytrees[[1]])$weight. It is NULL if E(mytrees[[1]])$weight is NULL

    • h_pau a numeric vector of length p1; each value is the edge length from a nodeu to its parent (equals 1 if u is a root node). This vector by default is all 1s. If weighted_edge=TRUE, h_pau is set to E(mytree[[1]])$weight, the input edge weights.

    • v_units (leaf_ids) integers; just without the names.

    • subject_id_list a list of length p1; each element is a vector of subject ids belonging to the leaf descendants of node u (internal or leaf node)

  • resB

    • Y A matrix of N by J; binary measurements with rows ordered by leaf groups (leaf_ids[[1]] and leaf_ids[[2]])

    • A A matrix of p2 by p2; each column contains some 1s, indicating the node in that column is an ancestor of the node represented in the row. Ancestor of a node include that node itself.

    • A_leaves A matrix of pL2 by p2; A submatrix of A that represents the ancestor matrix but only for leaves

    • leaf_ids A vector of Nintegers; ordered by the leaves as specified by mytree[[2]];

    • leaf_ids_units A list of length pL2, each element is a vector of subject ids belonging to each leaf node

    • leaf_ids_nodes leaf descendants based on tree2 structure; a list of length p2, each element is a vector of integers (between 1 and pL2; id is only for leaf nodes) indicating the leaf nodes

    • ancestors a list of length pL2, each element is the vector of ancestors (between 1 and p2; id is among all nodes)

    • edge_lengths a list of length pL2, each element is a numeric vector of edge lengths from the root node to the leaf. It is computed based on E(mytrees[[2]])$weight. It is NULL if E(mytrees[[2]])$weight is NULL

    • h_pau a numeric vector of length p2; each value is the edge length from a nodeu to its parent (equals 1 if u is a root node). This vector by default is all 1s. If weighted_edge=TRUE, h_pau is set to E(mytree[[2]])$weight, the input edge weights.

    • v_units (identical to leaf_ids) just without the names.

    • subject_id_list a list of length p2; each element is a vector of subject ids that are in the leaf descendants of node u (internal or leaf node)

  • all_ord A permuted vector of consecutive integers up to N; for each row (i.e., position) in the output resB$Y, the corresponding row in the original input data Y.

NB:

need to consider edge_lengths and weighted under missing labels for tree1.

See Also

lotR::design_tree().

Examples

rm(list=ls())
library(igraph)

library(doubletree)
library(MASS)
library(poLCA)
library(BayesLCA)

# first tree - over causes:
data("example_cause_edges")
cause_tree <- graph_from_edgelist(example_cause_edges, directed = TRUE)

nodes1  <- names(igraph::V(cause_tree))
leaves1 <- names(igraph::V(cause_tree)[igraph::degree(cause_tree, mode = "out") == 0])
rootnode1 <- names(igraph::V(cause_tree)[igraph::degree(cause_tree, mode = "in") == 0])
pL1 <- length(leaves1)
p1  <- length(nodes1)
# set the levels l*_u for nodes in the cause tree; nodes
# in the same level will share a slab variance multiplier tau_l* (times the edge length
# eminating from the parent).
V(cause_tree)$levels <- rep(1,p1) # this is to set every node to the same level.
E(cause_tree)$weight <- rep(1,length(E(cause_tree))) # set equal edge lengths of 1.

# second tree - over domains:
data("example_domain_edges")
domain_tree <- graph_from_edgelist(example_domain_edges, directed = TRUE)
nodes2  <- names(igraph::V(domain_tree))
leaves2 <- names(igraph::V(domain_tree)[igraph::degree(domain_tree, mode = "out") == 0])
rootnode2 <- names(igraph::V(domain_tree)[igraph::degree(domain_tree, mode = "in") == 0])
pL2 <- length(leaves2)
p2  <- length(nodes2)
V(domain_tree)$levels <- rep(1,p2)
E(domain_tree)$weight <- rep(1,length(E(domain_tree)))

mytrees <- list(tree1 = cause_tree, tree2 = domain_tree)

data("X0") # this is the PHMRC data cleaned by `openVA`
Y <- X0

data("cause_ids") # leaf labels for observations when mapped to the first tree.
data("domain_ids") # leaf labels for observations when mapped to the second tree.


leaf_ids <- list(id1 = cause_ids, id2 = domain_ids)
weighted_edges = c(FALSE,FALSE) # For illustrative purposes, the following ignores
                                # the input edge lengths.
dsgn  <- design_doubletree(Y,leaf_ids,mytrees,weighted_edges)

table(dsgn$leaf_ids,useNA="ifany")

resA2 <- dsgn$resA2
resB <- dsgn$resB
sum(resA2$Y-resB$Y,na.rm=TRUE)


#
# leaf labels in tree 1 has missing values; here we take PHMRC data and
# artifically remove ALL the cause-of-death labels (tree1 is cause-of-death tree) for
# a site in India (AP):
#

## The following data creation codes are commented out because it is only needed once
# during package creation:
# cause_ids_except_AP <- cause_ids
# cause_ids_except_AP[which(domain_ids=="AP")] <- NA
# save(cause_ids_except_AP,file="data/cause_ids_except_AP.rda",compress="xz")

data("cause_ids_except_AP") # leaf labels for observations when mapped to the first tree.
leaf_ids2 <- list(id1 = cause_ids_except_AP, id2 = domain_ids)


dsgn2     <- design_doubletree(Y,leaf_ids2,mytrees,weighted_edges)

##table(dsgn2$resA2$leaf_ids,useNA="ifany") # here will list the number of observations
# with missing labels in tree1.

# resA2 <- dsgn2$resA2
# resB <- dsgn2$resB
# sum(resA2$Y-resB$Y,na.rm=TRUE)


# aa = resA$Y[resB$ord[210],] # the row in resA$Y that corresponds to row id 210 in resB$Y
# bb = resB$Y[210,]
#
# setequal(which(is.na(aa)),which(is.na(bb))) & sum(aa[!is.na(aa)]==bb[!is.na(bb)])
#
# aa = resA$Y[7426,] # the row in resA$Y that correspond to a particular row in resB$Y.
# bb = resB$Y[resB$inv_ord[7426],]
# resA$leaf_ids[7426]
# leaf_ids[[2]][resA$ord][7426]
# resB$leaf_ids[resB$inv_ord[7426]]
# setequal(which(is.na(aa)),which(is.na(bb))) & sum(aa[!is.na(aa)]==bb[!is.na(bb)])


#
#
# #
# # TRY FITTING MODELS:
# #
#
# mod <- nlcm_doubletree(Y,leaf_ids2,mytrees,weighted_edge = c(FALSE,FALSE),
#                 ci_level = 0.95,
#                 get_lcm_by_group = FALSE,
#                 update_hyper_freq = 50,
#                 print_freq = 1,
#                 quiet      = FALSE,
#                 plot_fig   = FALSE,
#                 tol        = 1E-8,
#                 tol_hyper = 1E-4,
#                 max_iter = 200,
#                 nrestarts = 1,
#                 keep_restarts = TRUE,
#                 parallel = TRUE,
#                 log_restarts = FALSE,
#                 log_dir = ".",
#                 vi_params_init = list(),
#                 hyperparams_init = list(),
#                 random_init = FALSE,
#                 random_init_vals = list(mu_gamma_sd_frac = 0.2,
#                                         mu_alpha_sd_frac = 0.2,
#                                         tau1_lims = c(0.5,1.5),
#                                         tau2_lims = c(0.5,1.5),
#                                         u_sd_frac = 0.2, # this is for logit of prob1.
#                                         psi_sd_frac = 0.2,
#                                         phi_sd_frac = 0.2),
#                 hyper_fixed = list(K=3, # number of latent classes.
# a2=matrix(1,nrow=length(dsgn$ancestors[[1]]),ncol=max(dsgn$levels[[2]])),
# b2=matrix(10,nrow=length(dsgn$ancestors[[1]]),ncol=max(dsgn$levels[[2]])),
# a1 = rep(1,max(dsgn$levels[[1]])),b1 = rep(10,max(dsgn$levels[[1]])),
# # both (a1,b1),(a2,b2) encourage shrinkage towards the parent.
# dmat = matrix(1,nrow=length(dsgn$ancestors[[1]]),ncol=length(dsgn$ancestors[[2]])),
### (cause,domain).
# s1_u_zeroset = NULL,
# s1_u_oneset = c(1),
# s2_cu_zeroset = NULL,
# s2_cu_oneset = rep(list(1),pL1),
# tau_update_levels = list(1,1)
#                 )
# )

zhenkewu/doubletree documentation built on Oct. 21, 2023, 7:04 a.m.