This script is aimed at fixing the batch size or the number of features in each batch. We present a simulation study to address this issue. We plot the system time for the different batch sizes (taking into account the parallel overhead and the time taken to complete one topic model).
rm(list=ls()) n.out <- 500 omega_sim <- rbind( cbind( rep(1, n.out), rep(0, n.out)), cbind( rep(0, n.out), rep(1, n.out)), cbind( seq(0.6, 0.4, length.out = n.out), 1- seq(0.6, 0.4,length.out=n.out)) ) dim(omega_sim) K <- dim(omega_sim)[2] barplot(t(omega_sim), col = 2:(K+1), axisnames = F, space = 0, border = NA, main=paste("No. of clusters=", K), las=1, ylim = c(0,1), cex.axis=1.5,cex.main=1.4) freq <- rbind( c(0.1, 0.2, rep(0.70/998, 998)), c(rep(0.70/500,500), 0.2, rep(0.7/498,498), 0.1) ) str(freq) counts <- t( do.call(cbind, lapply(1:dim(omega_sim)[1], function(x) rmultinom(1,1000,prob=omega_sim[x,]%*%freq)))) dim(counts)
We form $20$ batches of $50$ genes each and then for each batch, we fit the topic model in parallel.
L_set <- c(2:50, 60,70,80,90, seq(100,500,10)); tic_toc <- array(0, length(L_set)) for(lnum in 1:length(L_set)){ chunk <- function(x,n) split(x, factor(sort(rank(x)%%n))) chunk_set <- chunk(1:dim(counts)[2],L_set[lnum]); out_time <- system.time(bag_topics <- parallel::mclapply(1:L_set[lnum], function(l) { counts_temp <- counts[,chunk_set[[L_set[lnum]]]]; index <- which(rowSums(counts_temp)==0); if(length(index)!=0){ counts_temp[index,]=1; } out <- maptpx::topics(counts_temp, K=2, tol=0.1); return(out) })) tic_toc[lnum] <- out_time[3]; cat("We assume batch size", L_set[lnum], "\n" ) }
We plot the time taken for computing the topic models for different number of batches.
tictoc <- readRDS("../rdas/tictoc_1000.rds"); plot(tictoc[1:40,1], tictoc[1:40,2], col="red", pch=20, lwd=1) plot(tictoc[,1], tictoc[,2], col="red", pch=20, lwd=1)
Note from the plot it seems that around $15$ batches would be optimal. For very small number of batches, the computational time for each topic model is relatively high. Also for larger number of batches, the overhead of parallel processing is pretty high. So, for $1000$ features, $15$ batches seems optimal choice.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.