Introduction

Dose response curves are ubiquitous in the biomedical sciences and drug discovery, and conventionally are used in preclinical drug discovery to compare the response of a biological system to a different compounds. Such assays have been extensively validated and optimised to produce highly accurate point estimates of IC50 which can then be correlated with chemical differences between the compounds.

In oncology drug discovery, efforts have also been made to profile compounds in diverse biological models of interest. Here the interest is not so much on explaining the difference in response between compounds, but the difference in response between models with different genetic characteristics. An example of this approach is the profiling of over 200 compounds in ~1000 cancer cell lines the Genomics of Drug Sensitivity in Cancer project carried out at the Wellcome Trust Sanger Centre.

The challenge in this setting is that achieving assay consistency between different cell lines is difficult, and variation in the data results in loss of signal in terms of the effect that genetic covariates have on response to compound. Reproducibility between different projects was raised as an issue in a 2013 Nature paper by Benjamin Haibe-Kains, and subsequently debated here and here.

The pgxsim package provides functions to simulate dose response data given a known genetic effect on drug response. It also provides functions to explore different methodologies to retrieve the genetic association. Together these provide the basis to carry out simulation studies to inform experimental and analysis design.

Set Up

Load libraries

Load required libraries etc:

library(pgxsim)
library(tidyverse)
library(drc)
library(purrr)
set.seed(10001)

About the tidyverse

This vignette makes extensive use of the tidyverse suite of packages and concepts. For an introduction see this video on YouTube or read the Managing Many Models section of Hadley Wickham's R For Data Science book..

Simulating data

Simulation of cell lines with mutation and dose response data

First simulate 100 cell lines with a mutation at 20% frequency using the sim_cell_lines function:

cl_data <- sim_cell_lines(n=100, type='discrete', prop=0.2)
head(cl_data)
cl_data %>% group_by(gene) %>% summarise(count=n())

Each cell line is scored 0 or 1 depending on whether it has the mutation or not. Don't worry about the cl_sd... columns, these provide the ability to make dose response data from some cell lines more noisy than others. Although we haven't provided any information on drug response, pIC50 values are also simulated.

Simulation of pIC50 data

To explore how the pIC50 values are simulated further, let's seperate out this part of the simulation using the sim_pIC50 function as follows:

sim_pIC50(mu=0, sd=0.1, beta=1, g=0, n=10, type='d')

Here we are providing an average value mu, a standard deviation sd, a beta value of 1 which is the size of the effect that we will eventually want to retrieve, and a value for g which determines whether or not the mutation is present.

To see how this works compare the following:

mean(sim_pIC50(mu=0, sd=0.1, beta=1, g=0, n=10, type='d'))
mean(sim_pIC50(mu=0, sd=0.1, beta=1, g=1, n=10, type='d'))

The difference between the first (wild type) and second (mutant) set of cell lines is approximately 1 as expected from the value of beta.

To include the values in the cell line data frame we can use the function as follows:

pIC50_data <- cl_data %>% 
  dplyr::mutate(pIC50 = sim_pIC50(mu=-1, sd=0.5, beta=1, g=gene, n=nrow(cl_data), type='d'))
pIC50_data

We can check the difference between cell lines:

pIC50_data %>% group_by(gene) %>% summarise(mean(pIC50))
ggplot(pIC50_data, aes(y=pIC50, x=as.factor(gene))) +
  geom_boxplot(aes(colour=as.factor(gene))) + theme_bw()
lm(pIC50 ~ as.factor(gene), data=pIC50_data)

Alternatively this can be done in one step using the sim_cell_lines function. The dataset will of course be slightly different as this is a new simulation!:

pIC50_data2 <- sim_cell_lines(n=100, type='d', prop=0.2, mu=-1, sd=0.5, beta=1)
pIC50_data2 %>% group_by(gene) %>% summarise(mean(pIC50))
lm(pIC50 ~ as.factor(gene), data=pIC50_data2)

To summarise, thus far we have simulated 100 cell lines with a mutation that exists at 20% frequency. We have simulated pIC50 values for these cell lines given that the mutation confers a (average) 10 fold difference on sensitivity to drug (10 fold is 1 on a log10 scale).

Simulation of cell lines with gene expression and dose response data

In addition to simulating cell lines with a discrete genetic feature such as a mutation, we can also simulate cell lines with a continuous genetic feature such as gene expression. In this case we simulate both the (standardised) genetic feature value and the pIC50 at the same time as a multivariate distribution, and specify a value of Pearson's Correlation Coefficient to indicate how correlated the two are:

pIC50_exp_data <- sim_cell_lines(n=500, type='c', mu=0, sd=0.1, prho=sqrt(0.6))
pIC50_exp_data

The relationship can be visualised below:

ggplot(pIC50_exp_data, aes(gene, pIC50)) + 
  geom_point() +
  geom_smooth(method='lm') +
  theme_bw()

We can extract the slope parameter and (squared) correlation coefficient by fitting a linear model:

lm(pIC50~gene, data=pIC50_exp_data) %>% summary()

In this case the linear regression coefficient is less immediately useful since the genetic feature data is standardised. We are interested in whether it is non-zero and what the estimate of the correlation coefficient is.

Simulation of dose response data

We now need to simulate the experimental data itself, namely the response seen when the cell line is treated with compound at different concentrations. This is done in the same way regardless of how the pIC50 values were generated. The sim_dose_response function can be used to do this as follows:

ex_dr1 <- sim_dose_response(pIC50=0, minconc=0.001, maxconc=30, 
                            ndoses=10, nreps=1, sd_prop=0, sd_add=0)
ex_dr1

Here we specify the pIC50 value, the lowest concentration tested minconc, the highest concentration tested maxconc, the number of doses ndoses, the number of replicates per dose nreps, and the additive and proportional variance sd_prop and sd_add. Since we specified zero variance, we get the perfect dose response curve with an IC50 of 1 (pIC50=0):

ggplot(ex_dr1, aes(x=conc, y=resp)) + geom_point() + geom_line() + scale_x_log10() + theme_bw()

By changing sd_add and sd_prop we can introduce experimental noise into the data:

ex_dr2 <- sim_dose_response(pIC50=0, minconc=0.001, maxconc=30, 
                            ndoses=10, nreps=3, sd_prop=0.1, sd_add=0.1)
ggplot(ex_dr2, aes(x=conc, y=resp)) + geom_smooth(se = FALSE, method='loess') + 
  geom_point() + scale_x_log10() + theme_bw()

Simulating data for multiple cell lines

Data for the first 10 cell lines we simulated earlier can be generated using the purrr::map function to generate a nested data frame as follows:

dr_data <- pIC50_data %>%
  dplyr::filter(cell_id <= 10) %>%
  dplyr::mutate(data=purrr::map(pIC50, ~sim_dose_response(pIC50=., 
                            minconc=0.001, maxconc=30, ndoses=10, 
                            nreps=3, sd_prop=0.1, sd_add=0.1)))
dr_data
dr_data$data[[1]]

This data frame can be unnested and the data plotted:

dr_data_unnested <- tidyr::unnest(dr_data)
dr_data_unnested
ggplot(dr_data_unnested, aes(x=conc, y=resp)) + 
  geom_point() +
  scale_x_log10() +
  facet_wrap(~cell_id) +
  theme_bw()

Simulating a full dataset

The functions that pgxsim provides are controlled by a variety of parameters, and we would like to explore the effect that these parameters have when trying to retrieve the genetic association. To start with we want to specify a data frame containing all of the combinations of parameters that we wish to explore. The tidyr::crossing function is useful to achieve this, below we set up four simulations with different values for the proportional and additive variance, but the same principle can be applied to vary other parameters.

param_df <- tidyr::crossing(minconc=0.001, maxconc=30, ndoses=10, 
                            nreps=3, sd_prop=c(0.001, 0.2), 
                            sd_add=c(0.001, 0.2)) %>%
    dplyr::mutate(sim_no=row_number())
param_df

Next this data frame is crossed with the pIC50 data frame from earlier to give four simulations for each cell line:

simulation_df <- tidyr::crossing(pIC50_data, param_df)
simulation_df

Finally we use the data in the data frame as parameters for the `sim_dose_response function:

simulation_df <-  simulation_df %>% dplyr::mutate(
  dr_data=purrr::pmap(.l=list(pIC50, minconc, maxconc, ndoses, nreps, sd_prop, sd_add),
                      .f=sim_dose_response)) %>%
  dplyr::select(sim_no, cell_id, gene, pIC50, sd_prop, sd_add, dr_data)
simulation_df %>% dplyr::select(sim_no, cell_id, dr_data)

The data for the first cell line is plotted below:

simulation_df %>%
  dplyr::filter(cell_id==1) %>%
  tidyr::unnest() %>%
  ggplot(aes(x=conc, y=resp)) + geom_point() + scale_x_log10() + facet_wrap(~sim_no)

Estimating pIC50 values from simulated data

Using drc to estimate IC50's one-at-a-time

The drc package contains useful functions to fit conventional Hill curves to dose response data. For example:

fit <- drc::drm(resp~conc, fct=LL.4(), data=ex_dr2) 
plot(fit)
log10(ED(fit, 50, display = FALSE))

This can be applied across multiple dose response datasets using purrr::map again:

drc_results <- simulation_df %>%
  dplyr::mutate(fit=purrr::map(dr_data, ~drc::drm(resp~conc, fct=LL.4(), data=.)),
                drc_pIC50=purrr::map_dbl(fit, ~log10(ED(., 50, display=FALSE)[1,'Estimate'])))
drc_results %>% dplyr::select(-dr_data, -fit)

Plotting the derived pIC50s against the original pIC50s shows how much the different amounts of experimental variability affects the accuracy of the curve fit:

ggplot(drc_results, aes(x=pIC50, y=drc_pIC50, colour=as.factor(gene)))+
  geom_point() +
  theme_bw() +
  facet_wrap(~sim_no) +
  geom_abline(slope=1, intercept=0, colour='blue')

Using non-linear least squares to estimate IC50's one-at-a-time

The minpack.lm package provides functionality to fit any non-linear model to data including Hill curves. For example:

nfit <- minpack.lm::nlsLM(resp~1-conc/(10^(ic50)+conc),
                    data=ex_dr2,
                    start = c(ic50=0))
nfit

Parameters can be extracted from the model using the broom::tidy function:

broom::tidy(nfit)

However this results estimates on a log-normal scale, so the convenience nls_extract function can be used instead to return estimates on a log10 scale.

nls_extract(nfit)

This can be applied across multiple dose response datasets using purrr::map again to apply the convenience nls_fit function which carried out the fit above:

nls_results <- simulation_df %>%
  dplyr::mutate(fit=purrr::map(dr_data, nls_fit))
nls_results

The pIC50 values obtained can be extracted by mapping the nls_extract function to the fit for each cell line:

nls_pIC50s <- nls_results %>%
  dplyr::mutate(res=purrr::map(fit, nls_extract)) %>%
  dplyr::select(-dr_data,-fit) %>%
  tidyr::unnest() 
nls_pIC50s

Finally the estimated vs actual values can be plotted:

ggplot(nls_pIC50s, aes(x=pIC50, y=nls_pIC50, colour=as.factor(gene))) +
  geom_point() +
  theme_bw() +
  facet_wrap(~sim_no) +
  geom_abline(slope=1, intercept=0, colour='blue')

Using mixed effects models to estimate IC50's simultaneously

Rather than estimating the IC50 values one by one, it is possible to estimate them in a single model fit using a non-linear mixed effects model (REF nlme). For more information on this approach see a paper by Vis et al 2016 (REF). Essentially the benefit here is that information on variability can be shared across cell lines which helps improve the accuracy of the fit.

Firstly, the nesting needs to be adjusted so that there is one row per simulation rather than one row per cell line. This means that the data frame in the data list column represents data for all cell lines in that simulation which is what the nlme_fit function requires:

nlme_data <- simulation_df %>%
  tidyr::unnest() %>%
  dplyr::group_by(sim_no, sd_prop, sd_add) %>%
  tidyr::nest()

nlme_data

The fit can then be done using the convenience function nlme_fit:

nlme_res <- nlme_data %>%
  dplyr::mutate(fit=purrr::map(data, nlme_fit))
nlme_res

The pIC50 values obtained can be extracted using the convenience function nlme_extract and merged with the original pIC50 values:

nlme_pIC50s <- nlme_res %>%
  dplyr::mutate(res=purrr::map(fit, nlme_extract)) %>%
  dplyr::select(-data,-fit) %>%
  tidyr::unnest() %>%
  dplyr::inner_join(pIC50_data, by='cell_id') %>%
  dplyr::select(-cl_sd_prop, -cl_sd_add)
nlme_pIC50s

Finally the estimated vs actual values can be plotted:

ggplot(nlme_pIC50s, aes(x=pIC50, y=nlme_pIC50, colour=as.factor(gene))) +
  geom_point() +
  theme_bw() +
  facet_wrap(~sim_no) +
  geom_abline(slope=1, intercept=0, colour='blue')

Estimating genetic effects

Using estimated IC50s in a linear regression

Original data

Earlier in the simulation we defined the genetic covariate beta to equal 1, so a linear regression comparing mutated and wild type gives an estimate of around 1:

lm(pIC50 ~ gene, data=pIC50_data)
ggplot(pIC50_data, aes(x=as.factor(gene), y=pIC50)) +
  geom_boxplot() +
  theme_bw()

drc pIC50 estimates

To calculate the value of the genetic covariate from the pIC50 values estimated by drc, we can nest the drc_results data frame by simulation and carry out the linear regression on each simulation. We then use broom::tidy to extract the genetic covariate estimate from the linear fit:

drc_lm <- drc_results %>%
  dplyr::select(-fit, -dr_data) %>%
  dplyr::group_by(sim_no, sd_prop, sd_add) %>%
  tidyr::nest() %>%
  dplyr::mutate(lm_fit=purrr::map(data, ~lm(drc_pIC50 ~ gene, .)),
                lm_res=purrr::map(lm_fit, broom::tidy)) %>%
  dplyr::select(-lm_fit, -data) %>%
  tidyr::unnest() %>%
  dplyr::filter(term=='gene')
drc_lm

We see that our estimate of the genetic covariate is less precise in the simulations where there is more noise in the data.

nls pIC50 estimates

We can carry out a similar analysis to that in the previous sections on the nls results:

nls_lm <- nls_pIC50s %>%
  dplyr::group_by(sim_no, sd_prop, sd_add) %>%
  tidyr::nest() %>%
  dplyr::mutate(lm_fit=purrr::map(data, ~lm(nls_pIC50 ~ gene, .)),
                lm_res=purrr::map(lm_fit, broom::tidy)) %>%
  dplyr::select(-lm_fit, -data) %>%
  tidyr::unnest() %>%
  dplyr::filter(term=='gene')
nls_lm

The nls followed by lm procedure does a good job of estimating the beta coefficient, although more simulations would be necessary to confirm that this is always true.

nlme pIC50 estimates

We can carry out a similar analysis to that in the previous sections on the nlme results:

nlme_lm <- nlme_pIC50s %>%
  dplyr::group_by(sim_no, sd_prop, sd_add) %>%
  tidyr::nest() %>%
  dplyr::mutate(lm_fit=purrr::map(data, ~lm(nlme_pIC50 ~ gene, .)),
                lm_res=purrr::map(lm_fit, broom::tidy)) %>%
  dplyr::select(-lm_fit, -data) %>%
  tidyr::unnest() %>%
  dplyr::filter(term=='gene')
nlme_lm

Once again estimates around the expected value are obtain, but multiple simulations with the same parameters would have to be carried out to get an estimate of how accurate the estimates are on average.

Including the genetic effect in the mixed effects model

In the previous section, recovering the genetic effect was a two stage process of first estimating pIC50 and then using this in a linear regression. However, this approach discards the variability associated with the pIC50 point estimates. This uncertainty can be retained by instead including the genetic covariate in the mixed effects model itself.

The function nlme_gene_fit is used to fit this model:

nlme_gene_res <- nlme_data %>%
  dplyr::mutate(fit=purrr::map(data, nlme_gene_fit))
nlme_gene_res

Since the genetic covariate is included in the model, the estimate can be retrieved directly using broom::tidy:

nlme_gene_res %>%
    dplyr::mutate(res=map(fit, broom::tidy, effects='fixed')) %>%
    dplyr::select(-data, -fit) %>%
    tidyr::unnest() %>%
    dplyr::filter(grepl('b', term)) 

However the estimate is on the log-normal scale, so the nlme_gene_extract function can be used to carry out the appropriate conversions:

nlme_gene_res %>%
    dplyr::mutate(res=map(fit, nlme_gene_extract)) %>%
    dplyr::select(-data, -fit) %>%
    tidyr::unnest()

Using this approach, the estimate of the genetic effect is near the value obtained with lm for all simulations, even where a lot of experimental variability is present.

Next Steps

This vignette explains the basic workings and concepts of the pgxsim package. See the Simulations vignette to understand how to use the package to carry out larger scale simulations.

Session Info

sessionInfo()


chapmandu2/pgxsim documentation built on May 6, 2019, 10:13 a.m.