ROCker-Meth (Receiver Operating Characteristic curve analyzER of DNA Methylation data) tool consists of four main modules:

  1. computation of Area Under the Curve (AUC) values from Receiver operating characteristic (ROC) Curve analysis of methylation levels (i.e., beta values) in tumor versus normal samples (compute_AUC)

  2. segmentation of AUC values by a tailored Hidden Markov Model (HMM) (find_dmrs);

  3. estimation of the statistical significance of AUC segments by Wilcoxon-Mann-Whitney (WMW) test on beta values of tumor versus normal samples

  4. identification of sample specific DMRs by Z-score statistics (compute_z_scores).

The package include also two fuctions (write_dmr_output, write_z_scores) to easily produce BED and SEG files to visually inspect discovered DMR using a genome browser.

ROCker-Meth Tutorial

The following tutorial describes the use of ROCker-Meth to identify Differentially Methylated Sites (DMSs) with AUC statistics, Differentially Methylated Regions (DMRs) by a tailored Heterogeneous Hidden Markov Model based algorithm and samples supporting DMRs.

First, we need to install the ROCker-Meth R package, available at https://github.com/cgplab/ROCkerMeth

# install.packages("devtools")
devtools::install_github("cgplab/ROCkerMeth")

Once you have installed the package, you need to load it (with other useful packages)

library(Rockermeth)

Prepare the data

You need to properly prepare DNA methylation data. Required data include:

  1. Table of beta values (percentage) of tumor samples (samples in columns, CpG sites in rows).

  2. Table of beta values (percentage) of normal samples (samples in columns, CpG sites in rows).

  3. Data frame reporting chromosomal and genomic position of each CpG site.

As an example, we will use a subset of the Prostate Adenocarcinoma dataset from TCGA (PRAD). To to the fact that we will use only 20 TP and 10 NT samples, results are markedly different with respect to the original dataset. Subset data have been already prepared and available from the ROCkerMeth package.

head(PRAD_TP_subset[, 1:5])
head(PRAD_NT_subset[, 1:5])
head(illumina450k_hg19)

Identification of Differentially Methylated Sites (DMSs) by AUC

ROCker-Meth uses AUC statistics to identify DMSs. It implements a function (compute_AUC) able to perform parallel computing of the AUC for each site. This step takes a while. In case, you can increase ncores parameter to calculate AUCs faster).

PRAD_auc <- compute_AUC(tumor_table = PRAD_TP_subset, control_table = PRAD_NT_subset,
                        ncores = 2, min_samples_frac = 0)
head(PRAD_auc)

min_samples_frac refers to the fraction of samples required at a give site (beta value available: not NA). By setting it to 0, we will calculate AUC only for those sites having complete data (i.e. available beta value for all tumor and normal samples). Now, you can look at the distribution of AUCs and visually estimate the fraction of hypo (AUC < 0.2) and hyper (AUC > 0.8) DMSs.

auc_density <- density(PRAD_auc, bw = 0.2, from = 0, to = 1, na.rm = TRUE)
{plot(auc_density, main="Histogram of AUC", frame.plot = FALSE, col = "black", las = 1)
abline (v = c(0.2, 0.8), col = c("blue", "red"), lty = 2)
text(.1, max(auc_density$y*0.95), "hypo-methylation", cex = .8)
text(.9, max(auc_density$y*0.95), "hyper-methylation", cex = .8)}

You can print and write a table reporting DMS analysis and related annotation:

## calculate average beta value per site in both tumor (TP) and normal (NT) samples
avg_beta_tp <- apply(PRAD_TP_subset, 1, mean, na.rm = TRUE)
avg_beta_nt <- apply(PRAD_NT_subset, 1, mean, na.rm = TRUE)
## create a flag for differential methylation analysis
dms_flag <- rep(NA, length(PRAD_auc))
dms_flag[which(PRAD_auc > 0.8)] <- "hyper"
dms_flag[which(PRAD_auc < 0.2)] <- "hypo"
## create the table
dms_table <- cbind(illumina450k_hg19, avg_beta_nt, avg_beta_tp, PRAD_auc, dms_flag)
head(dms_table)
#write.table(dms_table, file = "Table_DMS.txt", sep = "\t", row.names = FALSE, quote = FALSE)

draw a plot like the following

cor_auc_betadiff <- cor(dms_table$PRAD_auc, (dms_table$avg_beta_tp-dms_table$avg_beta_nt), use = "na")
{smoothScatter(dms_table$PRAD_auc, (dms_table$avg_beta_tp-dms_table$avg_beta_nt), main="Beta difference (TP vs NT) vs AUC ",
              xlab = "AUC", ylab = "Beta difference (TP vs NT)", las = 1, xlim = c(0, 1),
              ylim = c(-1, 1))
text(0.05, .9, paste0("R=", format(cor_auc_betadiff, digits = 2) ))}

and order data by most significant differentially methylated sites

dms_table.srt <- dms_table[order(abs(dms_table$PRAD_auc-0.5), decreasing = TRUE), ]
head(dms_table.srt)

Identification of Differentially Methylated Regions (DMRs) by HMM

To identify Differentially Methylated Regions, ROCker-Meth uses a strategy based on a Heterogeneous Hidden Markov Model that segments AUC scores to classify sites in stretches of hyper-methylation, no differential methylation or hypo-methylation. To do that, we will use the find_dmrs function.

## run segmentation of AUC
PRAD_dmr_table <- find_dmrs(tumor_table = PRAD_TP_subset, control_table = PRAD_NT_subset,
                       auc_vector = PRAD_auc, reference_table = illumina450k_hg19, ncores=2)
head(PRAD_dmr_table)

You can draw a vulcano plot to identify significant DMRs (GSTP1 is highlighted)

{plot(PRAD_dmr_table$mean_beta_diff, -log10(PRAD_dmr_table$fdr), main= "Volcano Plot", log = "y", las = 1,
      frame.plot = FALSE, pch = 16, col = "grey",
      ylab = "-log10(FDR)", xlab = "Avg Beta difference (TP vs NT)", cex = .7)
gstp1 <- which(PRAD_dmr_table$chr == "11" & PRAD_dmr_table$start < 67351000 & PRAD_dmr_table$end > 67351000)
points(PRAD_dmr_table$mean_beta_diff[gstp1], -log10(PRAD_dmr_table$fdr)[gstp1], pch = 16, cex = 1, col = "black")
text(PRAD_dmr_table$mean_beta_diff[gstp1], -log10(PRAD_dmr_table$fdr)[gstp1], "GSTP1", pos = 2)
abline(h = -log10(0.05), lty = 2)}

Now you can perform single sample analysis by Z-score statistics (it takes a while)

PRAD_sample_score <- compute_z_scores(tumor_table = PRAD_TP_subset, control_table = PRAD_NT_subset,
                                      dmr_table = PRAD_dmr_table, reference_table = illumina450k_hg19,
                                      min_size = 6)
head(PRAD_sample_score$z_scores[, 1:5])

Finally, the table of DMRs and sample scores can be written in a bed or a seg file to be viewed with genomic browser, such as IGV.

write_dmr(PRAD_dmr_table, path = "~/PRAD.subset")
write_z_scores(PRAD_sample_score, path = "~/PRAD.subset")

ROCker-Meth with different parameter settings

In our parameter study, we observed that our method is very sensible to changes in F. F is the parameter used to split the total AUC signal variance in two parts: the variance of not differentially methylated state and the variance of differentially methylated states. Here we will see how ROCker-Meth results are affected by the parameter F from its standard value (0.4) to different values. Based on our study, we expect that lowering F will increase specificity at the cost of sensitivity. The opposite situation occurs for larger values of F.

## run ROCker-Meth in stringend mode, F = 0.2 (i.e., ratiosd)
PRAD_dmr_table_hi_specificity <- find_dmrs(tumor_table = PRAD_TP_subset, control_table = PRAD_NT_subset,
                                      auc_vector = PRAD_auc, reference_table = illumina450k_hg19,
                                      ratiosd = 0.2)
## run ROCker-Meth in high sensitivity mode, F = 0.6
PRAD_dmr_table_hi_sensitivity <- find_dmrs(tumor_table = PRAD_TP_subset, control_table = PRAD_NT_subset,
                                      auc_vector = PRAD_auc, reference_table = illumina450k_hg19,
                                      ratiosd = 0.6)
dat_n_seg <- c(dim(PRAD_dmr_table_hi_specificity)[1],
               dim(PRAD_dmr_table)[1],
               dim(PRAD_dmr_table_hi_sensitivity)[1])
names(dat_n_seg) <- c("hi spec (F=0.2)",
                      "standard (F=0.4)",
                      "hi sens (F=0.6)")
barplot(dat_n_seg, border = NA, las = 1, ylab = "number of segments", beside = TRUE)

We can nicely demostrate that modulating F affects sensibility/sensitivity trade-off by looking at the distribution of p-values of DMRs in the three settings.

## to make p-values comparable, we will focus on DMRs of <15 CpG sites in length

nsites <- 15
mybox <- list(-log10(PRAD_dmr_table_hi_specificity$p_value[which(PRAD_dmr_table_hi_specificity$nsites<nsites & PRAD_dmr_table_hi_specificity$state != 2)]),
              -log10(PRAD_dmr_table$p_value[which(PRAD_dmr_table$nsites<nsites & PRAD_dmr_table$state != 2)]),
              -log10(PRAD_dmr_table_hi_sensitivity$p_value[which(PRAD_dmr_table_hi_sensitivity$nsites<nsites & PRAD_dmr_table_hi_sensitivity$state != 2)]))
names(mybox) <- names(dat_n_seg)
{
    par (mar = c(7.5, 4, 4, 2))
    boxplot(mybox, varwidth = TRUE, las = 2, ylab = "-log10(P-value)", frame.plot = FALSE, pch = 19, cex = .6)
}

Identify candidate subtype specific DMRs

Here we will see a strategy that can be used to identify DMRs specific to molecular subtypes. Also in this case we will consider the PRAD dataset. characterized by 8 molecular subytpes (TCGA-PRAD Consortium, Cell 2015).

head(PRAD_molecular_subtypes)
table(PRAD_molecular_subtypes$Subtype)

Subtype specific DMRs can be identified as those showing different Z-score statistics across the different subtypes. First, we match methylation sample IDs with patient ID.

subtypes <- names(table(PRAD_molecular_subtypes$Subtype))
subtype_data <- rep(NA, ncol(PRAD_TP_subset))
for (subtype in subtypes) {
  ix <- which(substr(colnames(PRAD_TP_subset), 1, 12) %in% PRAD_molecular_subtypes$PATIENT_ID[which(PRAD_molecular_subtypes$Subtype == subtype)])
  subtype_data[ix] <- subtype
}
## subtypes in PRAD subset
table(subtype_data)

A reasonable strategy to identify candidates subtype specific DMRs is by using one-way ANOVA test.

## exclude DMRs with not estimated Z-score in all sample
## Attention: it can take a while
anova.res <- rep(NA, nrow(PRAD_sample_score$z_scores))
for (i in 1: length(anova.res)) {
    if (all(is.finite(PRAD_sample_score$z_scores[i, ]))) {
        anova.res[i] <- summary(aov(PRAD_sample_score$z_scores[i, ] ~ subtype_data))[[1]][1,5]
    }
}
## create histogram of ANOVA p-values
hist(-log10(anova.res), 100)

Finally, we can check if candidate subtype specific DMRs are actually specific of one (or more) subtype.

## how many DMRs with p < 1e-5
j <- which(anova.res < 1e-5)
anova.res[j]
## As an exmaple, plot the firsts
{
    par (mfrow = c(2, 2))
    for (i in 1: 4) {
        mymax <- round(max(abs(PRAD_sample_score$z_scores[j[i], ]))/10)*10
        boxplot(PRAD_sample_score$z_scores[j[i], ] ~ subtype_data, varwidth = TRUE, outline = FALSE, ylim = c(-mymax, mymax),
                xlab="",
                frame.plot = FALSE, las = 2, ylab = "RockerSS (Z-score)",
                main = rownames(PRAD_sample_score$z_scores[j, ])[i])
        stripchart(PRAD_sample_score$z_scores[j[i], ] ~ subtype_data, pch = 19, col = "grey60",
                   vertical = TRUE, method = "jitter", add = TRUE)
        grid()
        abline (h = 0, lty = 2)

    }
}

In these boxplots, we see that the first case and third case (left) show DMRs with pronounced hyper-methylation in ERG and SPOP subtypes, second and fourth cases (right) show DMRs that are clearly hypo-methylated in FOXA1 subtype and not in the other subtypes.

Additonal info, code and data

https://github.com/cgplab/ROCkerMeth

Citation

Pan-cancer characterization of differentially DNA methylated regions enabled by ROCker-Meth. Benelli et al., submitted.



cgplab/ROCkerMeth documentation built on March 27, 2022, 9:57 p.m.