In fynnwi/intratumormeth: Intratumor DNA Methylation Analysis

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this vignette, the package conumee [@hovestadt_conumee_2017] will be used to derive copy number profiles from the methylation intensity data. The resulting data will be further post-processed to facilitate downstream analysis.

library(intratumormeth)
# paths
outputDir <- "~/sfb824/packagepdgfra_output"
inputDir <- "~/sfb824/packagepdgfra_input"

Required data

In addition to sample methylation data, conumee requires methylation data of control samples with a flat genome. This reference data was kindly provided by Dr. Martin Sill from DKFZ. Furthermore, an annotation object has to be passed to conumee which tells the package which genomic regions should be evaluated. By means of the annotation object, a number of detail regions can be specified for which dedicated copy number values will be computed. For this study, all gene loci that are covered by at least 20 CpG probes on the EPIC array will be specified as detail regions.

Since different annotation and reference objects will be used for male and female samples, information about the patient sex is required. This is contained in patientsheet.

Load samplesheet and patientsheet:

samplesheet <- read.csv("~/sfb824/packagepdgfra_input/samplesheet.csv")
samplesheet %>% head()
patientsheet <- read.csv(file.path(inputDir, "patientsheet.csv"))
patientsheet %>% head()

Prepare a custom conumee annotation

The conumee annotation object can be generated by specifying start and stop coordinates of genomic regions for which a detailed output is desired. The start and stop coordinates of these detail regions will be extracted from the EPIC probes annotation.

library(minfi)
library(conumee)
library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
annoEpic <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) %>% 
  as.data.frame()

This codeblock groups the probes by their associated genes and extracts positions of first and last probe for a given gene.

genesEpic <- annoEpic %>% 
  dplyr::filter(UCSC_RefGene_Name != "") %>% 
  dplyr::mutate(gene = stringr::str_split(UCSC_RefGene_Name, ";")) %>% 
  tidyr::unnest(cols = c(gene)) %>% 
  dplyr::group_by(chr, gene) %>% 
  dplyr::summarise(from = min(pos),
                   to = max(pos),
                   n_probes = length(unique(Name)),
                   contexts = paste(unique(unlist(stringr::str_split(UCSC_RefGene_Group, ";"))), collapse = ";"),
                   .groups = "drop")
genesEpic %>% head()

How many probes are there per gene?

library(ggplot2)
ggplot(genesEpic, aes(x = n_probes)) + 
  geom_histogram(binwidth = 1) + 
  intratumormeth_theme()

Keep only genes with at least 20 probes coverage:

minProbesPerGene <- 20

detailGenes <- genesEpic %>% 
  dplyr::filter(n_probes >= minProbesPerGene) %>% 
  # rename potentially duplicate names by adding the chromosome as a suffix
  dplyr::mutate(gene = dplyr::case_when(gene %in% .$gene[duplicated(.$gene)] ~ paste0(gene, "_", chr),
                          TRUE ~ gene))

detailRegions <- GenomicRanges::GRanges(seqnames = detailGenes$chr,
                         ranges = IRanges::IRanges(start = detailGenes$from, end = detailGenes$to),
                         name = detailGenes$gene,
                         # thick specifies the plotting window for detail plots
                         thick = IRanges::IRanges(start = detailGenes$from - 5000, end = detailGenes$to + 5000),
                         seqinfo = GenomeInfoDb::Seqinfo(genome = "hg19"))
detailRegions

Overall, 12,724 genetic loci are included in this detail regions object.

Check if all regions from Martin Sill's annotation object are also contained:

# this object has name 'anno'
load("~/data/cnv_ref/CNanalysis4_conumee_ANNO.vh20150715.RData")

# show detail region names not contained in my detail regions
anno@detail[!anno@detail$name %in% detailRegions$name, ]

# add those genetic loci to my detail regions
detailRegions <- c(detailRegions, anno@detail[!anno@detail$name %in% detailRegions$name, ])
# check:
anno@detail[!anno@detail$name %in% detailRegions$name, ]

Five genomic regions were not contained, these were added manually.

Store this GRanges object for later use:

dir.create(file.path(outputDir, "copynumber"))
saveRDS(detailRegions, file.path(outputDir, "copynumber", "cnv_detailregions_granges.Rdata"))

Finally, the conumee annotation object can be generated taking our previously created detail regions object as input. Two versions are computed, one with and the other without sex chromosomes.

detailRegionsNoXy <- detailRegions[!(seqnames(detailRegions) %in% c("chrX", "chrY"))]

cnvAnnoEpicXY <- CNV.create_anno(array_type = "EPIC", chrXY = TRUE, detail_regions = detailRegions)
cnvAnnoEpic <- CNV.create_anno(array_type = "EPIC", chrXY = FALSE, detail_regions = detailRegionsNoXy)

cnvAnnoEpic

Save them:

saveRDS(cnvAnnoEpic, file.path(outputDir, "copynumber", "conumee_annotation_epic.Rdata"))
saveRDS(cnvAnnoEpicXY, file.path(outputDir, "copynumber", "conumee_annotation_epic_xy.Rdata"))

gc()

Run conumee

Load the annotation data created above:

annoF <- readRDS(file.path(inputDir, "conumee", "conumee_annotation_epic.Rdata"))
annoM <- readRDS(file.path(inputDir, "conumee", "conumee_annotation_epic_xy.Rdata"))
annoF

Load the reference data required by conumee:

refF <- readRDS(file.path(inputDir, "conumee", "conumee_ref_epic_F_Sill.Rdata"))
refM <- readRDS(file.path(inputDir, "conumee", "conumee_ref_epic_M_Sill.Rdata"))
refM

Now all inputs are prepared and we can run conumee:

run_conumee(outputDir, samplesheet, patientsheet, refF, refM, annoF, annoM)

This writes a CNV.analysis object for each sample into the specified output directory. Since these objects contain a lot of data I will extract the important information to avoid having to load large objects into memory later on.

Genomeplots

Depending on how many detail regions were specified in the conumee annotation (around 12000 in this case), the resulting plot can become unreadable since each detail region name is printed as a label. Therefore, it is possible to manually specify the detail regions that should be printed onto the plot:

detailRegions <- c("MDM4", "MYCN", "TET3", "GLI2", "IDH1", "EPHA3", "PIK3CA", 
"FGFR3/TACC3", "PDGFRA", "TET2", "TERT", "VEGFA", "MYB", "EGFR", 
"HGF", "CDK6", "MET", "KIAA1549/BRAF", "FGFR1/TACC1", "MYBL1", 
"MYC", "CDKN2A/B", "PTCH1", "TET1", "PTEN", "MGMT", "VEGFB", 
"CCND1", "CCND2", "CDK4", "MDM2", "RB1", "TP53", "NF1", "PPM1D", 
"C19MC", "SMARCB1", "NF2")

Generate a CNV genomeplot for every sample organized by patient:

cnv_genomeplots_per_patient(outputDir, samplesheet, detailRegions = detailRegions)

CNV Post-processing

extract detail region CNV
extract segments
homogenize segments
TODO: detect focal events where a detail region copy number is significantly different from its neighborhood

The above run_conumee() call generated a lot of data: 4.7 GB in this case. The is mostly because there is a lot of redundancy in the information that is stored for each sample. To accelerate further analysis and to avoid having to load 4.7 GB into memory, the function postprocess_conumee_files() extracts the essential information and saves it to the output directory.

postprocess_conumee_files(outputDir)

The generated data can now be read in:

segmentCnv <- readRDS(file.path(outputDir, "copynumber", "cnv_segments.Rdata"))
segmentCnv
detailCnv <- readRDS(file.path(outputDir, "copynumber", "cnv_detailregions.Rdata"))
detailCnv