Overview

This script is aimed at fixing the batch size or the number of features in each batch. We present a simulation study to address this issue. We plot the system time for the different batch sizes (taking into account the parallel overhead and the time taken to complete one topic model).

Data preparation

rm(list=ls())
n.out <- 500
omega_sim <- rbind( cbind( rep(1, n.out), rep(0, n.out)), 
                    cbind( rep(0, n.out), rep(1, n.out)),
                    cbind( seq(0.6, 0.4, length.out = n.out), 
                           1- seq(0.6, 0.4,length.out=n.out)) )
dim(omega_sim)

K <- dim(omega_sim)[2]
barplot(t(omega_sim), 
        col = 2:(K+1),
        axisnames = F, space = 0, 
        border = NA, 
        main=paste("No. of clusters=", K),
        las=1, ylim = c(0,1), cex.axis=1.5,cex.main=1.4)

freq <- rbind( c(0.1, 0.2, rep(0.70/998, 998)),
               c(rep(0.70/500,500), 0.2, rep(0.7/498,498), 0.1) )
str(freq)

counts <- t( do.call(cbind,
                     lapply(1:dim(omega_sim)[1], 
                            function(x) 
                                rmultinom(1,1000,prob=omega_sim[x,]%*%freq))))
dim(counts)

Bagging/ Batching for variable selection

We form $20$ batches of $50$ genes each and then for each batch, we fit the topic model in parallel.

L_set <- c(2:50, 60,70,80,90, seq(100,500,10));
tic_toc <- array(0, length(L_set))
for(lnum in 1:length(L_set)){
chunk <- function(x,n) split(x, factor(sort(rank(x)%%n)))
chunk_set <- chunk(1:dim(counts)[2],L_set[lnum]);

out_time <- system.time(bag_topics <- parallel::mclapply(1:L_set[lnum], function(l)
                           {
                                counts_temp <- counts[,chunk_set[[L_set[lnum]]]];
                                index <- which(rowSums(counts_temp)==0);
                                if(length(index)!=0){
                                  counts_temp[index,]=1;
                                }
                                out <- maptpx::topics(counts_temp,                                                                        K=2,
                                                      tol=0.1);
                                return(out)
                           }))
tic_toc[lnum] <- out_time[3];
cat("We assume batch size", L_set[lnum], "\n" )
}

Batches vs Time of computation plot

We plot the time taken for computing the topic models for different number of batches.

tictoc <- readRDS("../rdas/tictoc_1000.rds");
plot(tictoc[1:40,1], tictoc[1:40,2], col="red", pch=20, lwd=1)
plot(tictoc[,1], tictoc[,2], col="red", pch=20, lwd=1)

Note from the plot it seems that around $15$ batches would be optimal. For very small number of batches, the computational time for each topic model is relatively high. Also for larger number of batches, the overhead of parallel processing is pretty high. So, for $1000$ features, $15$ batches seems optimal choice.



kkdey/varselectpx documentation built on May 20, 2019, 10:42 a.m.