3. Confidence Intervals
In YAPSA: Yet Another Package for Signature Analysis

library(BiocStyle)

library(YAPSA)
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg19)
library(knitr)
opts_chunk$set(echo=TRUE)
opts_chunk$set(fig.show='asis')

Introduction: Confidence Intervals {#introduction}

In order to assess confidence in the setting of modeling in ordinary differential equations (ODEs), the concept of profile likelihood was introduced [@Raue_Bioinformatics2009]. In YAPSA, this concept was adapted to the computation of confidence intervals (CIs) for the exposures to mutational signatures (@Alex2013). To determine the CI for a computed single value in a high-dimensional vector, this value is perturbed and the remaining values of the vector are computed again, yielding an alternative data model with one degree of freedom less than the initial model. Then, log-likelihoods are computed from the distribution of the residuals of the initial and the alternative model and a likelihood ratio test is being computed.

In the context of mutational signatures, this corresponds to the determination of the CI for the exposure of one given mutational signature exposure. To this end, this exposure value is perturbed, i.e., $H_{uv}$, the exposure to signature $u$ in sample $v$, is changed by a small value $H_{uv} \rightarrow H_{uv} + \epsilon_{uv}$, and the exposures to the remaining signatures are computed again by non-negative least squares, yielding an alternative data model with one degree of freedom less than the initial model. Then, as described above, log-likelihoods are computed from the distribution of the residuals of the initial and the alternative model and a likelihood ratio test is being computed. This yields a p-value for the perturbation, which may need to be extrapolated by a Gauss-Newton method to yield 95% CIs.

Recap: compute signature exposures

In the following section, we briefly recapitulate the analysis of SNV mutational signatures on an example data set as performed in 1. Usage of YAPSA. We thus first load the example data stored in the package:

data(sigs)
data(cutoffs)
data("lymphomaNature2013_mutCat_df")
current_cutoff_vector <- cutoffCosmicValid_abs_df[6,]

We then perform a supervised analysis of SNV mutational signatures using signature-specific cutoffs:

lymphoma_COSMIC_listsList <-
  LCD_complex_cutoff_combined(
      in_mutation_catalogue_df = lymphomaNature2013_mutCat_df,
      in_cutoff_vector = current_cutoff_vector, 
      in_signatures_df = AlexCosmicValid_sig_df, 
      in_sig_ind_df = AlexCosmicValid_sigInd_df)

We assign subgroups to the different samples:

data(lymphoma_PID)
colnames(lymphoma_PID_df) <- "SUBGROUP"
lymphoma_PID_df$PID <- rownames(lymphoma_PID_df)
COSMIC_subgroups_df <- 
  make_subgroups_df(lymphoma_PID_df,
                    lymphoma_COSMIC_listsList$cohort$exposures)

And finally plot the obtained result:

cap <- "Exposures to SNV mutational signatures"

exposures_barplot(
  in_exposures_df = lymphoma_COSMIC_listsList$cohort$exposures,
  in_signatures_ind_df = lymphoma_COSMIC_listsList$cohort$out_sig_ind_df,
  in_subgroups_df = COSMIC_subgroups_df)

Compute the confidence intervals

In order to assess trustworthiness of the computed exposures, YAPSA provides the calculation of CIs. Analogously to CIs for SNV mutational signatures, the CIs for Indel mutational signatures are computed using the concept of profile likelihood. This is performed by the function variateExp().

cap <- "Confidence interval calculation for exposures to mutational 
        signatures"

complete_df <- variateExp(
  in_catalogue_df = lymphomaNature2013_mutCat_df,
  in_sig_df = lymphoma_COSMIC_listsList$cohort$signatures,
  in_exposures_df = lymphoma_COSMIC_listsList$cohort$exposures,
  in_sigLevel = 0.025, in_delta = 0.4)

Of note and as opposed to the output of the LCD function family, the result of the function variateExp() is a data frame in a long format, because for every combination of a signature and a sample, several values now have to be stored:

head(complete_df, 12)

Here, the column exposure contains the values which had been computed before. The columms relLower and relUpper contain the factors with which to multiply the exposures in order to get the lower and upper bounds of the 95% CIs. The absolute values of these lower and upper bounds are stored in the columns lower and upper.

There also is a custom function to plot exposures with confidence intervals:

plotExposuresConfidence(
  in_complete_df = complete_df, 
  in_subgroups_df = COSMIC_subgroups_df,
  in_sigInd_df = lymphoma_COSMIC_listsList$cohort$out_sig_ind_df)

This produces a figure similar to the display of exposures obtained above, but in contrast to this former way of displaying signature exposures by stacked barplots, here we chose a facet plot with the signatures as rows in order to be able to display the CIs, which are indicated as whiskers. We furthermore would like to emphasize that if a signature is not present in a sample, i.e., the exposure to that signature is 0, then the upper and lower bounds of the confidence interval are zero as well.

Of note, the functionality to compute 95% CIs for signature exposures is also available for the analysis of Indel mutational signatures, an example is provided in the corresponding vignette.