sampler: Main function for model estimation

Description Usage Arguments Value Examples

View source: R/sample_mcmc.R

Description

Estimating restricted latent class models (RLCM) by MCMC sampling given pre-specified upper bound for (M) latent state dimension. This function performs MCMC sampling with user-specified options. NB:

  1. add flexibility to specify other parameters as fixed (or partially fixed such as the rows of Q).

Usage

1
sampler(dat, model_options, mcmc_options)

Arguments

dat

binary data matrix (row for observations and column for dimensions)

model_options

Specifying assumed model options:

  • n The number of observations

  • t_max The maximum (guessed) number of clusters in the data during the posterior inference

  • m_max For a model with pre-specified number of states, m_max; In an infinite dimension model, m_max is the maximum (guessed) latent state dimension during the posterior inference (see slice_sampler to come); one can increase this number if this package recommends so in the printed message;

  • a_theta, a_psi hyperparameters for true and false positive rates; a_theta and a_psi are both 2 by L. If provided with these parameter values, the algorithm does not pool information across measurements. It assumes independent priors for all true and false positive rates.

  • pooling_pr Set to TRUE if to pool theta and psi (must NOT set a_theta or a_psi; must set a0_TPR, a0_FPR,e0_TPR,f0_TPR,e0_FPR,f0_FPR that are hyperparameters of the priors for TPR or FPR:

    theta[1] <--Beta-- a_theta[1:2,l] <--deterministic-- (a0_TPR, hyper_pseudo_n_TPR <--Gamma-- (e0_TPR, f0_TPR)) --deterministic--> a_theta[1:2,l] --Beta-->theta[2]; (a_theta has duplicated columns, hence an arbitary column index l)

  • hyper_pseudo_n_TPR,hyper_pseudo_n_FPR pseudo sample size in Beta(hyper_pseudo_n_TPR*a0_TPR,hyper_pseudo_n_TPR*(1-a0_TPR)); usually works well by setting this to a large number, say 10 or 50. Each is assigned a Gamma(e0_XPR, f0_XPR) prior

  • a0_TPR mean of a Beta prior TPR

  • a0_FPR mean of a Beta prior FPR

    The following four parameters are not needed if hyper_pseudo_n_TPR,hyper_pseudo_n_FPR are specified.

  • e0_TPR,f0_TPR the first, second Gamma parameter for the TPR pseudo sample size hyperparameter: hyper_pseudo_n_TPR

  • e0_FPR,f0_FPR the first, second Gamma parameter for the FPR pseudo sample size hyperparameter: hyper_pseudo_n_FPR

  • log_v The character string representing the prior distribution for the number of true clusters, e.g., "function(k) {log(0.1) + (k-1)*log(0.9)}". We pre-computed log of the coefficients in Mixture of Finite Mixtures (Miller and Harrison, 2017, JASA). Use this code: mfm_coefficients(eval(parse(text=model_options0$log_pk)), model_options0$gamma, model_options0$n, model_options0$t_max+1)

  • the_partition Put a vector of z here if not to update the partition/clustering. The z must be 1) from 1 to t, which is the number of groups, 2) consecutive from 1 to t.

  • a_alpha, b_alpha Just for infinite latent state dimension model - Gamma hyperparameter for the hyperprior on alpha. (see slice_sampler to come)

The following are not updated if provided with fixed values

  • Q Q matrix

  • theta a vector of true positive rates

  • psi a vector of false positive rates

  • p a vector of latent state prevalences

  • alpha The hyperparameter for Beta(alpha/m_max,1) (can set to m_max); For infinite dimension model, the hyperparameter for IBP (can set to 1).

mcmc_options

Options for MCMC algorithm:

  • n_total total number of MCMC iterations

  • n_keep the number of MCMC samples kept for posterior inference.

  • n_burnin the number of burnin iterations to be discarded for drawing the posterior inference.

  • n_split the number of restricted Gibbs scan to arrive at a launch state; see split_merge and restricted_gibbs

  • print_mod print intermediate model fitting information

  • constrained update the Q matrix with identifiability constraints (if TRUE); otherwise, set to FALSE.

  • block_update_H update rows of H (if TRUE) or not (if NULL or FALSE - must be so for slice_sampler to come).

  • block_update_Q update columns of Q (if TRUE) or not (if NULL or FALSE - must be so for slice_sampler to come). Then no identifiability constraint is imposed upon Q at any iterations.

  • ALL_IN_ONE_START TRUE for putting all subjects in one cluster, FALSE by starting from a hierarchical agglomerative clustering (complete linkage) and cut to produce floor(t_max/4) clusters. Consider this as a warm start.

  • MORE_SPLIT when getting to launch state of the partition, TRUE For biasing towards split when reaching a random launch state; FALSE for choosing a pair (i,j) uniformly and then merge (if they belong to distinct clusters) or split (if they belong to the identical cluster).

Value

posterior samples for quantities of interest. It is a list comprised of the following elements:

The following are recorded if they are not fixed in a priori:

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
## Not run: 
  rm(list=ls())
  library(rewind)
  library(matrixStats)
  library(ars)
  
  #optional packages:
  library(mcclust.ext) # http://wrap.warwick.ac.uk/71934/1/mcclust.ext_1.0.tar.gz
  
  # color palette for heatmaps:
  YlGnBu5   <- c('#ffffd9','#c7e9b4','#41b6c4','#225ea8','#081d58',"#092d94")
  hmcols    <- colorRampPalette(YlGnBu5)(256)
  
  # simulate data:
  L0 <- 100
  options_sim0  <- list(N = 200,  # sample size.
                        M = 3,    # true number of machines.
                        L = L0,   # number of antibody landmarks.
                        K = 8,    # number of true components.
                        theta = rep(0.9,L0), # true positive rates.
                        psi   = rep(0.1,L0), # false positive rates.
                        alpha1 = 1, # half of the people have the first machine.
                        frac = 0.2, # fraction of positive dimensions (L-2M) in Q.
                        #pop_frac = rep(1/K0,K0) # population prevalences.
                        pop_frac = c(rep(2,4),rep(1,4)) # population prevalences.
  )
  
  image(simulate_data(options_sim0,SETSEED = TRUE)$datmat,col=hmcols)
  simu     <- simulate_data(options_sim0, SETSEED=TRUE)
  simu_dat <- simu$datmat
  rle(simu$Z)
  
  #
  # visualize truth:
  #
  # plot_truth(simu,options_sim0)
  
  #
  # specifying options:
  #
  
  # model options:
  m_max0 <- 5
  model_options0 <- list(
    n   = nrow(simu_dat),
    t_max  = 40,
    m_max  = m_max0,
    b  = 1, # Dirichlet hyperparameter; in the functions above,
    # we used "b" - also can be called "gamma"!.
    #Q  = simu$Q,
    a_theta = replicate(L0, c(9,1)),
    a_psi   = replicate(L0, c(1,9)),
    #theta = options_sim0$theta,
    #psi   = options_sim0$psi,
    alpha   = 0.1,
    frac = 0.2,
    #p_both      = rep(0.5,3),#,c(0.5,0.5^2,0.5^3,0.5^4,0.5^5)
    #p0 = rep(0.5,m_max0), # <--- this seems to make a difference in convergence.
    log_pk = "function(k) {log(0.1) + (k-1)*log(0.9)}"# Geometric(0.1).
    #Prior for the number of components.
  )
  
  # pre-compute the log of coefficients in MFM:
  model_options0$log_v<-mfm_coefficients(eval(parse(text=model_options0$log_pk)),
                                         model_options0$b,
                                         model_options0$n,
                                         model_options0$t_max+1)
  # mcmc options:
  mcmc_options0 <- list(
    n_total = 200,
    n_keep  = 50,
    n_split = 5,
    print_mod = 10,
    constrained = TRUE, # <-- need to write a manual about when these options are okay.
    block_update_H = TRUE,
    block_update_Q = !TRUE,
    ALL_IN_ONE_START =!TRUE,  # <--- TRUE for putting all subjects in one cluster,
    # FALSE by starting from a hierechical clustering
    # (complete linkage) and cut
    # to produce floor(t_max/4). Consider this as a warm start.
    MORE_SPLIT = TRUE,
    print_Q    = TRUE,
    init_frac  = c(0.51,0.99)
  )
  
  # run posterior algorithm for simulated data:
  out <- sampler(simu_dat,model_options0,mcmc_options0)
  
  ###############################################################
  ## Posterior summaries:
  ###############################################################
  out0 <- out
  out <- postprocess_H_Q(out0)
  
  #Z_SAMP_FOR_PLOT <- out$z_samp  # <---- use pseudo-indicators for co-clustering.
  # tend to be more granular.
  Z_SAMP_FOR_PLOT <- out$z_sci_samp # <--- use scientific-cluster indicators.
  
  # posterior co-clustering probabilities (N by N):
  comat <- z2comat(Z_SAMP_FOR_PLOT)
  image(1:options_sim0$N,1:options_sim0$N, comat,
        xlab="Subjects",ylab="Subjects",col=hmcols,main="co-clustering")
  for (k in 1:options_sim0$K){
    abline(h=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
    abline(v=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
  }
  
  # Summarize partition C:
  ## Approach 0: The maximum a posterior: issue - the space is huge (Bell number).
  ##             Not implemented.
  ## Approach 1: Minimizing the least squared error between the co-clustering indicators
  ##            and the posterior co-clustering probabilities; (Dahl 2006)
  ## Approach 2: Wade - estimate the best partition using posterior expected loss
  ##             by variation of information (VI) metric.
  nsamp_C <- dim(Z_SAMP_FOR_PLOT)[2]
  z_Em    <- rep(NA,nsamp_C)
  
  par(mfrow=c(2,3))
  ## Approach 1:
  for (iter in 1:nsamp_C){
    z_Em[iter] <- norm(z2comat(Z_SAMP_FOR_PLOT[,iter,drop=FALSE])-comat,"F")
  }
  ind_dahl <- which.min(z_Em)
  # plot 1:
  image(1:options_sim0$N,1:options_sim0$N,
        z2comat(Z_SAMP_FOR_PLOT[,ind_dahl,drop=FALSE]),col=hmcols,
        main="Dahl (2006) - least squares to hat{pi}_ij")
  for (k in 1:options_sim0$K){
    abline(h=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
    abline(v=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
  }
  
  ## Approach 2 - use "mcclust.ext" pacakge:
  # process the posterior samples of cluster indicators:
  psm    <- comp.psm(t(Z_SAMP_FOR_PLOT))
  # point estimate using all methods:
  bmf.VI <- minVI(psm,t(Z_SAMP_FOR_PLOT),method="all",include.greedy=TRUE)
  summary(bmf.VI)
  
  # plot 2:
  image(1:options_sim0$N,1:options_sim0$N,
        z2comat(matrix(bmf.VI$cl["avg",],ncol=1)),col=hmcols,
        main="Wade clustering")
  
  for (k in 1:options_sim0$K){
    abline(h=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
    abline(v=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
  }
  #heatmap(z2comat(bmf.VI$cl["avg",]),col=hmcols)
  
  # credible sets as defined in Wade and Ghahramani 2017.
  bmf.cb <- credibleball(bmf.VI$cl[1,],t(Z_SAMP_FOR_PLOT))
  
  # plot 3:
  image(1:options_sim0$N,1:options_sim0$N,
        z2comat(matrix(bmf.cb$c.horiz[1,],ncol=1)),col=hmcols,
        main="Wade credible ball - horizontal")
  
  for (k in 1:options_sim0$K){
    abline(h=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
    abline(v=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
  }
  #heatmap(z2comat(bmf.cb$c.horiz),col=hmcols)
  
  # plot 4:
  image(1:options_sim0$N,1:options_sim0$N,
        z2comat(matrix(bmf.cb$c.uppervert[1,],ncol=1)),col=hmcols,
        main="Wade credible ball - upper vertical")
  
  for (k in 1:options_sim0$K){
    abline(h=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
    abline(v=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
  }
  #heatmap(z2comat(bmf.cb$c.uppervert),col=hmcols)
  
  # plot 5:
  image(1:options_sim0$N,1:options_sim0$N,
        z2comat(matrix(bmf.cb$c.lowervert,ncol=1)),col=hmcols,
        main="Wade credible ball - lower vertical")
  
  for (k in 1:options_sim0$K){
    abline(h=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
    abline(v=cumsum(rle(simu$Z)$lengths)[k]+0.5,lty=2)
  }
  #heatmap(z2comat(bmf.cb$c.lowervert),col=hmcols)
  
  
  #
  # summarize Q (first need to transpose it)
  #
  # Approach 1: compute the error |QQ' - E[QQ'|Y]|_Frobneious
  # Approach 2: compute the marginal co-activation probability
  #             P(\sum_m Q_ml >= 1, \sum_m Q_ml' >= 1) -
  #             This is invariant to the relabeling of latent states, or cluster labels.
  nsamp_Q <- dim(out$Q_merge_samp)[3]
  EQQ     <- matrix(0,nrow=dim(out$Q_merge_samp)[2],
                    ncol=dim(out$Q_merge_samp)[2]) # L by L.
  EQQ_binary <- EQQ
  # Approach 1:
  for (iter in 1:nsamp_Q){
    A   <- t(out$Q_merge_samp[,,iter])
    EQQ <- (A%*%t(A)+EQQ*(iter-1))/iter # invariant to the row reordering of Q.
    EQQ_binary <- 0+(A%*%t(A)>0)+EQQ_binary
  }
  
  # Approach 2:
  EQQ_percent <- EQQ_binary/nsamp_Q
  image(EQQ_percent,col=hmcols,main="co-activation patterns across L dimensions")
  
  # Approach 1:
  Q_Em <- rep(NA,nsamp_Q)
  for (iter in 1:nsamp_Q){
    A <- t(out$Q_merge_samp[,,iter])
    Q_Em[iter] <- norm(A%*%t(A) - EQQ,"F")
  }
  plot(Q_Em,type="o",main="||QQ'-E[QQ'|Y]||")
  
  # choose the indices minimizing the errors:
  ind_of_Q       <- which(Q_Em==min(Q_Em))
  
  #
  # visualize truth:
  #
  pdf(file.path("inst/example_figure/","bmf_truth.pdf"),width=12,height=6)
  plot_truth(simu,options_sim0)
  dev.off()
  
  #
  # visualize the individual latent states depending on whether Q is known or not.
  #
  if (!is.null(model_options0$Q)){ # <--- Q known.
    H_res     <- matrix(0,nrow=nrow(simu$datmat),ncol=sum(rowSums(model_options0$Q)!=0))
    H_pat_res <- matrix(0,nrow=nrow(simu$datmat),ncol=dim(out$H_star_merge_samp)[3])
    for (l in 1:dim(out$H_star_merge_samp)[3]){
      tmp_mylist <- out$mylist_samp[,l]
      tmp0 <- out$H_star_samp[tmp_mylist[match(out$z_samp[,l],tmp_mylist)],,l]
      #<------ could be simpler!!!
      tmp1 <- tmp0[,colSums(tmp0)!=0,drop=FALSE]
      tmp <- tmp1[,order_mat_byrow(model_options0$Q[rowSums(model_options0$Q)!=0,
                                                    ,drop=FALSE])$ord,drop=FALSE]
      H_pat_res[,l] <- bin2dec_vec(tmp,LOG=FALSE)
      H_res <- (tmp + H_res*(l-1))/l
    }
    apply(H_pat_res,1,table)
    image(f(H_res))
  } else{   # <---- Q unknown.
    #
    # This is the best estimated Q:
    #
    Q_merged <- out$Q_merge_samp[,,ind_of_Q[1]] # just picked one.
    NROW_Q_PLOT <- nrow(Q_merged) # sum(rowSums(Q_merged)!=0)
    Q_PLOT <- f(order_mat_byrow(Q_merged)$res) # t(Q_merged)
    #f(order_mat_byrow(Q_merged[rowSums(Q_merged)!=0,,drop=FALSE])$res)
    image(1:ncol(simu$datmat),1:NROW_Q_PLOT,
          Q_PLOT,
          main="Best Q (merged & ordered)",col=hmcols,
          xlab="Dimension (1:L)",
          ylab="Latent State (1:M)",yaxt="n",cex.lab=1.2)
    for (k in 1:NROW_Q_PLOT){
      abline(h=NROW_Q_PLOT-k+0.5,
             lty=2,col="grey",lwd=2)
    }
    
    #
    # Summarize H* and H*(Z) for individual predictions:
    #
    # If we compare it to H, then the clustering matters too, not
    # just H^*.
    #
    # Approach: just obtain the H samples from those iterations with the best Q
    #            obtained above.
    #
    H_res     <- matrix(0,nrow=nrow(simu$datmat),ncol=sum(rowSums(Q_merged)!=0))
    H_pat_res <- matrix(0,nrow=nrow(simu$datmat),ncol=length(ind_of_Q))
    # columns for the best indices.
    # here ind_of_Q are those that minimized the Q loss.
    for (l in seq_along(ind_of_Q)){
      tmp_mylist <- out$mylist_samp[,ind_of_Q[l]]
      tmp0 <- out$col_merged_H_star_samp[out$z_samp[,ind_of_Q[l]],,ind_of_Q[l]]
      tmp1 <- tmp0[,colSums(tmp0)!=0,drop=FALSE]
      tmp <- tmp1[,order_mat_byrow(Q_merged[rowSums(Q_merged)!=0,,
                                            drop=FALSE])$ord,drop=FALSE]
      H_pat_res[,l] <- bin2dec_vec(tmp,LOG=FALSE)
      H_res <- (tmp + H_res*(l-1))/l
    }
    apply(H_pat_res,1,table)
    pdf(file.path("inst/example_figure/",
                  "H_estimated_marginal_prob_vs_truth.pdf"),width=12,height=6)
    par(mfrow=c(1,2))
    image(f(H_res),col=hmcols,main="estimated marginal prob") # <--- get marg prob.
    image(f(simu$Eta[,order_mat_byrow(simu$Q)$ord]),col=hmcols,main="truth")
    dev.off()
    # issues: the order of the rows of Q at ind_of_Q might be different, so need to order them.
  }
  
  # posterior distribution over the number of pseudo-clusters T: <-- scientific clusters?
  plot(out$t_samp,type="l",ylab="T: #pseudo-clusters")
  
  ## individual predictions:
  pdf(file.path("inst/example_figure/","individual_pred_simulation.pdf"),height=15,width=12)
  par(mar=c(2,8,8,0),mfrow=c(2,1),oma=c(5,5,5,5))
  for (i in 1:nrow(simu_dat)){
    plot_individual_pred(apply(H_pat_res,1,table)[[i]]/sum(apply(H_pat_res,1,table)[[i]]),
                         1:model_options0$m_max,paste0("Obs ", i),asp=0.5)
  }
  dev.off()
  
  # check positive rate estimates:
  pdf(file.path("inst/example_figure/","check_PR_post_vs_prior.pdf"),width=12,height=6)
  par(mfrow=c(5,2))
  for (l in 1:L0){
    par(mar=c(1,1,1,2))
    baker::beta_plot(model_options0$a_psi[1,l],model_options0$a_psi[2,l])
    hist(out$psi_samp[l,],add=TRUE,freq=FALSE)
    legend("topright",legend = l)
    baker::beta_plot(model_options0$a_theta[1,l],model_options0$a_theta[2,l])
    hist(out$theta_samp[l,],add=TRUE,freq=FALSE)
  }
  dev.off()
  
  pdf(file.path("inst/example_figure/","posterior_sci_cluster_number.pdf"),
      width=10,height=6)
  scatterhist(1:ncol(out$z_sci_samp),apply(out$z_sci_samp,2,
                                           function(v){length(unique(v))}),
              "Index","tilde{T}: #scientific clusters")
  dev.off()
  
  # posterior distribution of active/effective dimensions (machines):
  pdf(file.path("inst/example_figure/","posterior_effective_machines_number.pdf"),
      width=10,height=6)
  plot(table(apply(out$col_merged_H_star_samp,c(3),
                   function(x) sum(colSums(x)>0)))/mcmc_options0$n_keep,
       xlab = "Effect Number of Machines",xlim=c(1,model_options0$m_max),
       ylab ="Posterior Probability",
       main="",lwd=4)
  dev.off()
  
  image(f(simu$Q),col=hmcols)
  
  image(f(simu$xi),col=hmcols)
  pred_xi <- 0+(out$H_star_samp[out$z_samp[,ind_dahl],,ind_dahl]%*%
                  apply(out$Q_samp[,,ind_dahl],c(1,2),mean))>0.5
  image(f(pred_xi),col=hmcols)
  image(f(pred_xi- simu$xi),col=hmcols)

  
  

## End(Not run) # END OF DONT RUN.

oslerinhealth/rewind documentation built on May 26, 2021, 6:56 a.m.