Introduction

When carrying out a pharmacogenomics study, the aim is to quantify the impact that genetic features have on the response to compound. However, there are a large number of parameters that could affect the ability of the experiment to do this:

In this vignette a general approach is described which can be scaled up using batch processing to cover very many parameter combinations and repeat simulations.

Set Up

Load libraries

Load required libraries etc:

library(pgxsim)
library(tidyverse)
library(drc)
library(purrr)
set.seed(10001)

About the tidyverse

This vignette makes extensive use of the tidyverse suite of packages and concepts. For an introduction see this video on YouTube or read the Managing Many Models section of Hadley Wickham's R For Data Science book..

The anatomy of a simulation experiment

Define the simulations

The first step in a simulation experiment is to define all of the simulations to be carried out. Each simulation will have its own set of parameters, and these can be put into a data frame as follows.

Firstly, specify the parameters that will remain fixed:

fixed_df <- dplyr::data_frame(type='discrete', mu=0, lb=-4, ub=Inf, sd=1, 
                       minconc=0.001, maxconc=30, ndoses=10, nreps=3, 
                       sd_prop=0.3, sd_add=0.15, n=50)
fixed_df

Secondly, specify the parameters that will vary and combine using the tidyr::crossing function. In this case we are considering three values of effect size (beta) and two values of different mutation frequency:

varying_df <- tidyr::crossing(prop=c(0.1,0.3), beta=log10(c(2,5,10))) %>%
  dplyr::mutate(sim_group=row_number())
varying_df

Finally, combine the two data frames again using the tidyr::crossing function, define how many times each simulation is to be run, and add a unique simulation identifier. This gives a data frame with one row per simulation to be carried out, and columns defining the various parameters.

complete_df <- crossing(fixed_df, varying_df, sim_rep=c(1:3)) %>%
  dplyr::mutate(sim_unique_id=row_number())
complete_df

Simulate cell lines

Next we simulate the cell lines for each simulation we defined in the previous section. The purrr::pmap function can be used to map the parameter column values into the sim_cell_lines function:

sim_cl_data <- complete_df %>%
  dplyr::mutate(cl_data=purrr::pmap(.l=list(n=n, type=type, prop=prop,
                                            mu=mu, sd=sd, beta=beta, 
                                            lb=lb, ub=ub),
                                    .f=sim_cell_lines))
sim_cl_data

This gives a nested data frame with one row per simulation as before, so this is unnested to generate a data frame with one row per simulated cell line. Even with this relatively small experiment 900 cell lines must be simulated!

sim_cl_data <- sim_cl_data %>%
  tidyr::unnest() %>%
  dplyr::select(-cl_sd_prop, -cl_sd_add)
sim_cl_data %>% dplyr::select(sim_unique_id, prop, beta, cell_id, gene, pIC50)

Simulate dose response data

Next the dose response data is simulated for each cell line, using the same principle of mapping the sim_dose_response function to each row of the cell line data frame using pmap::purrr. This gives a data frame with one row per cell line, and with the dose response data as a data frame in a list-col:

sim_dr_data <- sim_cl_data %>%
  dplyr::mutate(dr_data = purrr::pmap(.l=list(pIC50=pIC50, 
                                              minconc=minconc, maxconc=maxconc, 
                                              ndoses=ndoses, nreps=nreps, 
                                              sd_prop=sd_prop, sd_add=sd_add),
                                      .f=sim_dose_response))
sim_dr_data  %>% dplyr::select(sim_unique_id, prop, beta, cell_id, gene, pIC50, dr_data)

Clean and prepare simulated data

As a final step, unneeded columns are removed and the data frame is nested by the unique simulation id to give row row per simulation again. In this double-nested data frame the cell line information is a data frame list-col, and within this data frame dose response data is also a data frame list-col

input_df <- sim_dr_data %>%
  dplyr::select(sim_unique_id, cell_id, gene, pIC50, dr_data) %>%
  dplyr::group_by(sim_unique_id) %>%
  tidyr::nest()

input_df 

The cell line data for the first simulation is extracted below:

input_df$data[[1]]

And the dose response data for the first cell line of the first simulation is:

input_df$data[[1]]$dr_data[[1]]

Applying methods to simulated data

In the overview vignette, the following methods are introduced to extract the genetic effect from a dataset:

Each method is wrapped in its own function which gives an output in a standard format which can then be combined. The methods can be called as follows:

lm_method

lm_method(input_df$data[[1]]) 

nls_lm_method

nls_lm_method(input_df$data[[1]])

nlme_lm_method

nlme_lm_method(input_df$data[[1]])

nlme_gene_method

nlme_gene_method(input_df$data[[1]])

Apply methods to multiple simulations

In the example below, the lm_method can be applied to each simulation in the input_df data frame using purrr::map:

lm_res <- input_df %>%
  dplyr::transmute(sim_unique_id, res=purrr::map(data, lm_method)) %>%
  tidyr::unnest()
lm_res

Apply multiple methods to multiple simulations

To apply all four methods to the data for each simulation, we simply use the same approach for the other method functions:

all_res <- input_df %>%
  dplyr::mutate(lm_res=purrr::map(data, lm_method),
                nlme_lm_res=purrr::map(data, nlme_lm_method),
                nls_lm_method=purrr::map(data, nls_lm_method),
                nlme_gene_method=purrr::map(data, nlme_gene_method)) %>%
  dplyr::select(-data)
all_res

Reformat simulation results

As the outputs are standardised they can be gathered and combined into a single unnested data frame:

all_res_combined <- all_res %>%
  tidyr::gather(func, val, -sim_unique_id) %>%
  tidyr::unnest() %>%
  dplyr::select(-func)
all_res_combined

The original simulation definitions are then merged in:

final_df <- all_res_combined %>%
  dplyr::inner_join(complete_df, by='sim_unique_id')

Visualise simulation results

Finally the results can be visualsed, in this case comparing the p value obtained for each proportion and effect size. However, since there are only 3 repeat simulations for each combination of parameters it isn't possible to conclude much.

ggplot(final_df, aes(as.factor(round(beta,2)), -log10(test_pval), colour=method)) +
  geom_point(size=rel(3), shape=21, position = position_jitter(width=.2)) +
  facet_grid(prop~n) +
  theme_bw()

Batch Processing

Wrapper function

The analysis described in the last section is wrapped into the do_simulation_type2 function which takes the data frame of simulation parameters and runs a complete simulation for each row. It is is called as follows:

#first simulation
do_simulation_type2(complete_df[1,])

#for the first 2 simulations
do_simulation_type2(complete_df[1:2,])

More simulations

As we saw before, we really need more repeats of each simulation to understand how each method is performing. The code below defines and runs 20 repeats per simulation

large_sim_df <- crossing(fixed_df, varying_df, sim_rep=c(1:20)) %>%
  dplyr::mutate(sim_unique_id=row_number())

large_sim_results <- do_simulation_type2(large_sim_df)

Parallel processing

Introduction

The wrapper function makes it easy to define and run larger simulation experiments, however the time taken to do this becomes limiting. There are many additional parameters to explore and each simulation would have to be repeated 100s or 1000s of times for a power calculation. The simulation speed is dictated by number of cell lines in each simulation, and the total number of simulations.

However, since each simulation is independent of the others, parallelisation is the obvious way forward. In this section, the batchtools package is used since this allows the same code to be run in a number of different computing environments, from multicore processing on a workstation or laptop, to 1000s of cores across a cluster or cloud environment.

Define simulations

The same principle as before is used to define the simulations, the difference is that an additional column batch column is added which defines how the simulations are grouped together. In this case 16 batches are created to be split across however many cores are available. This number can be varied depending on the number of cores available and the number of simulations to be done.

parallel_sim_df <- crossing(fixed_df, varying_df, sim_rep=c(1:40)) %>%
  dplyr::mutate(sim_unique_id=row_number(),
                batch=sample(1:16, n(), replace = TRUE))
parallel_sim_df 

The subset_apply function is used to split the simulation definition data frame into batches and apply the analysis wrapper function.

subset_apply(k=1, df=parallel_sim_df, my_fun=do_simulation_type2)

Configure batchtools environment

First define where batchtools is to run. In the code below subdirectories of the temporary directory are created with randomised names - it's important to ensure that there are no special characters in the filepath.

library(batchtools)
bjwork_dir <- file.path(tempdir(), sample(LETTERS, 10) %>% paste(., collapse=''))
bjfiles_dir <- file.path(tempdir(), sample(LETTERS, 10) %>% paste(., collapse=''))
dir.create(bjwork_dir)

Next the batchtools registry is created, being sure to define the packages that are needed:

reg <- makeRegistry(packages=c('pgxsim', 'tidyverse'), #packages to include
                    work.dir = bjwork_dir,
                    file.dir = bjfiles_dir,
                    make.default=FALSE)

Then configure the parallelisation environment. The code below configures batchtools to run on a MacOS or Linux computer, or single AWS instance. Alternative makeClusterFunctions options are available for common job schedulers for use in an HPC environment.

reg$cluster.functions = makeClusterFunctionsMulticore(ncpus = parallel::detectCores())

Run the parallelised analysis

The jobs are created using the batchMap function - each batch id is mapped to the subset_apply function including the analysis wrapper function and simulation specification data frame as additional arguments:

batchMap(reg=reg, fun=subset_apply, k=1:max(parallel_sim_df$batch),
         more.args=list(df=parallel_sim_df, my_fun=do_simulation_type2))

The jobs are then submitted. In a cluster environment the resources variable can define how much resource is required for each job but this can be left empty here:

submitJobs(reg=reg, resources = list())

The jobs can be monitored until done:

getStatus(reg = reg)
done <- findDone(reg=reg)
job_tab <- getJobTable(reg=reg)

Process the results and tidy up

Once the jobs are all finished the output from each can be retrieved and combined:

parallel_res <- reduceResultsList(reg=reg, ids=findDone(reg=reg))
parallel_res_df <- dplyr::bind_rows(parallel_res) %>%
  dplyr::inner_join(parallel_sim_df, by='sim_unique_id')

And the registry removed to clean up behind us:

removeRegistry(reg=reg, wait=0)

Visualise the results

data(parallel_res_df)
ggplot(parallel_res_df, aes(as.factor(round(beta,2)), -log10(test_pval), colour=method)) +
  geom_boxplot(outlier.size=0) +
  facet_grid(prop~n) +
  theme_bw()

Now we see more clearly what we would expect, that all methods perform better when the effect size is large and the proportion of cell lines with the genetic mutation is high. The nls_lm method seems to do less well than the mixed-effects based methods.

More about cloud computing and clusters

Amazon Web Services

In our work, we have used high specification Amazon Web Service instance types to run larger simulations. At the time of writing, the c4.8xlarge instance type with 36 cores and 60Gb memory offers the best balance of cost-effectiveness and compute capacity for the simulations described here. Larger, more powerful instances are available, for example the x1.32xlarge instance has 128 cores and 3x the compute capacity but is 8x more expensive since it is optimised for large in-memory applications and has a almost 2Tb RAM.

It is remarkably simple to provision an EC2 instance with R and RStudio server installed. Instances can be launched based on community Amazon Machine Images (AMI's) which do 90% of the hard work, all that is then left is to install the pgxsim package and its dependencies. Once set up, the instance can be turned into an AMI for future use, or alternatively the instance can be stopped and started as required. Remember that billing is to the nearest whole hour though!

For more information on using Amazon Web Services, see the resources below:

HPC

The ultimate environment to work in is a cluster where potentially 1000s of cores might be available. This requires a little more configuration based on what job scheduler is installed, but batchtools makes this quite straightforward and the only thing that needs to be modified in the code in this vignette is the makeClusterFunctions... call used and the list of resources in the submitJobs call.

Conclusion

This vignette has described how to run large scale pharmacogenomic simulation studies in a parallelised computing environment.

Session Info

sessionInfo()


chapmandu2/pgxsim documentation built on May 6, 2019, 10:13 a.m.