In magrichard/dmprocrdata: A set of data dedicated to the dmprocr R package

knitr::opts_chunk$set(collapse=TRUE, comment = "#>", fig.width=9, fig.height=6, eval=TRUE, echo=FALSE, results="hide")

Introduction

dmprocr (differential methylation profiles clustering in R) is an open-access analysis software (R package). The purpose of this tutorial is to present a general method allowing to unbiasedly identify and characterize sets of genes that are epigenetically (de)regulated in cancer, using integrated bioinformatics and statistic studies of genomic, transcriptomic and epigenomic alterations.

Dataset

The dataset used for this example correspond to transcriptome, cnv and methylome data from the International Cancer Genome Consortium (TCGA) (534 patients with lung adenocarcinoma, PMID:25079552).

For chromosome 19 of each 534 patients, we collected the expression (RNAseq, HTSeq-Counts), gene copy-number (WGS, copy number segment) and methylation data (Infinium 27K et 450K) from the Genomic Commons Data portal.

The dataset is available as an R package called dmprocrdata (https://github.com/bcm-uga/dmprocrdata).

dmprocrdata contains the objects which are detailled below.

TO DO:
MR: Mettre les données dans le package dmrpocrdata.
FCh: done but no commit/push defore to be sure of the dataset.

data_trscr

data_trscr is a data matrix containing HTseq counts for each gene and each sample. Find more information on HTseq analysis pipeline on GDC website.

The rownames of the matrix define the gene_id (corresponding to the rownames of the gene_list BED file), the colnames define the sample ID (corresponding to the rownames of the exp_grp dataframe, i.e. dmprocr_ID)

data_trscr = dmprocrdata::data_trscr
is.matrix(data_trscr)
tail(data_trscr[, 1:5])
dim(data_trscr)

TO DO:
FCh: Some columns (samples) have names starting with number. It will be a poroblem. MR: samples now all start with 'patient_' FCh: in dmprocrdata data_XXX must be a matrix MR: data_XXX are now converted into matrices

data_cnv

data_cnv is a data matrix containing Copy Number Segment values for each gene and each sample. Find more information on Copy Number Segment analysis pipeline on GDC website.

Please note that $$CopyNumberSegment = log2 (CNV/2) $$

data_cnv = dmprocrdata::data_cnv
is.matrix(data_cnv)
head(data_cnv[,1:5])
dim(data_cnv)

TO DO:
FCh: Some columns (samples) have names starting with number. It will be a poroblem.

data_meth

data_meth is a data matrix containing methylation \beta values for each gene and each sample. Find more information on \beta values analysis pipeline on GDC website.

The rownames of the matrix define the probe_id (corresponding to the rownames of the pf_meth file), the colnames define the sample ID (corresponding to the rownames of the exp_grp dataframe, i.e. dmprocr_ID )

data_meth = dmprocrdata::data_meth
is.matrix(data_meth)
head(data_meth[,1:5])
dim(data_meth)

TO DO:
FCh: Some columns (samples) have names starting with number. It will be a poroblem. FCh: in dmprocrdata data_XXX must be a matrix

gene_list

gene_list is a BED file containing gene information.

Please provide a file with the following informations (conserve columns 1 to 7 names and pf_meth).

seqnames: the name of the chromosome.
start : the starting position of the feature in the chromosome.
end : the ending position of the feature in the chromosome.
gene_id : defines the name of the BED line, corresponds to the label of the feature
score : corresponds to a score associated with the feature, optionnal.
strand : defines the strand.
ref_genome : defines the reference genome (optional)

In this dataset, we used the GDC reference genome 'gencode.v22.annotation.gtf.gz' used in RNA-Seq alignment and by HTSeq, corresponding to GRCh38 (hg38).

gene_list = dmprocrdata::gene_list
#gene_list = gene_list[gene_list$gene_type == "protein_coding",]
#gene_list[,5] = gene_list$gene_name
#gene_list = gene_list[,1:7]
#colnames(gene_list) = NULL
head(gene_list)
dim(gene_list)
# rownames(gene_list) = gene_list[,4]
# gene_list = gene_list[1:10,]

TO DO:
FCh: gene_list - could be truncated to the 7 first columns ## MR -> done - with col5 = gene_symbol ## MR -> done but called 'gene_name', is it a problem? - limited to "protein_coding" ??? ## MR -> why? - names DO NOT HAVE TO BE conserved. ## MR -> I don't understand - col4 elements MUST be unique ## -> checked

WARNING: - concistency needs between gene_list[[1]] and pf_meth$Chromosome MR: I replaced '*' in pf_meth by 'chrM' => we need to check that this is OK

MR: pourquoi tu lances la commande rownames(gene_list) = gene_list[,4] , c'est déjà le cas non? ou bien il y a quelque chose qui m'échappe... MR: pourquoi tu enleves les noms de colonne?

pf_meth

pf_meth is a file containing methylation probes informations, generated by DNA methylation pipeline of the GDC, from Illumina Infinium Human Methylation 27 (HM27) and HumanMethylation450 (HM450) arrays.

Composite.Element.REF: the name of the probe.
Chromosome: the name of the chromosome.
Start : the starting position of the feature in the chromosome.
End : the ending position of the feature in the chromosome.
Gene_Symbol : corresponds to genes associated with this probe.

The row names of the file correspond to probes ID.

pf_meth = dmprocrdata::pf_meth
#pf_meth = pf_meth[order(pf_meth$Chromosome, pf_meth$Start), ]
#meth_data = meth_data[rownames(pf_meth),]
head(pf_meth)

TO DO:
MR: verifier quelles colonnes sont utilisées par Paul et les indiquer comme obligatoires dans cette section
FCh: "End" seems to be "Start" + 1 and "Gene_Symbol" embed many gs for a given probe. Id, Chromosome, Position, assembly should be enough. MR: I kept the 3 first columns and renamed them "probe_id", "seqnames", "position"

exp_grp

exp_grp is a dataframe containg patients information.

!/! Please provide a file with at least the following informations (conserve columns 1 to 6 names and order).

!/! Please make sure that you removed all sample replicates from your dataset

dmprocr_ID: the sample ID, a combinaison of patient_ID and sample type information, this corresponds to the unique entry of the dataframe.
patient_ID : the patient ID.
sample : the sample type, here 01 corresponds to tumor and 11 to healthy sample.
trscr : defines if transcription data are present for this sample, 1 corresponds to existing data (in the data_trscr data matrix), and 0 to missing data.
cnv : defines if copy number variation data are present for this sample, 1 corresponds to existing data (in the data_cnv data matrix), and 0 to missing data.
meth : defines if methylation data are present for this sample, 1 corresponds to existing data (in the data_cnv data matrix), and 0 to missing data.

Rownames of the dataframe corresponds to dmprocr_ID.

tissue_status : the tissue type, composed of two factors, "normal" and "patho" (for pathological)

exp_grp = dmprocrdata::exp_grp
head(exp_grp)
dim(exp_grp)

TO DO:
MR: PB, exp_grp est une liste et je ne sais pas comment changer cela

Missing data summary

grid_arrange_shared_legend <- function(..., nrow = 1, ncol = length(list(...)), position = c("bottom", "right")) {

  plots <- list(...)
  position <- match.arg(position)
  g <- ggplot2::ggplotGrob(plots[[1]] + ggplot2::theme(legend.position = position))$grobs
  legend <- g[[which(sapply(g, function(x) x$name) == "guide-box")]]
  lheight <- sum(legend$height)
  lwidth <- sum(legend$width)
  gl <- lapply(plots, function(x) x + ggplot2::theme(legend.position = "none"))
  gl <- c(gl, nrow = nrow, ncol = ncol)

  combined <- switch(position,
                     "bottom" = gridExtra::arrangeGrob(do.call(gridExtra::arrangeGrob, gl),
                                            legend,
                                            ncol = 1,
                                            heights = grid::unit.c(grid::unit(1, "npc") - lheight, lheight)),
                     "right" = gridExtra::arrangeGrob(do.call(gridExtra::arrangeGrob, gl),
                                           legend,
                                           ncol = 2,
                                           widths = grid::unit.c(grid::unit(1, "npc") - lwidth, lwidth)))
  grid::grid.newpage()
  grid::grid.draw(combined)
}

pt = ggplot2::ggplot(exp_grp, ggplot2::aes(sample, fill = as.factor(trscr))) + ggplot2::geom_bar() + ggplot2::ggtitle("trscr data") +
  ggplot2::labs(x = "sample type", y = "# patients", fill = "data") +
  ggplot2::scale_fill_manual(labels = c("missing", "existing"),  values = c("gray", "black"))
pc = ggplot2::ggplot(exp_grp, ggplot2::aes(sample, fill = as.factor(cnv))) + ggplot2::geom_bar() +  ggplot2::ggtitle("cnv data") +
  ggplot2::labs(x = "sample type", y = "# patients", fill = "data") +
  ggplot2::scale_fill_manual(labels = c("missing", "existing"),  values = c("gray", "black"))
pm = ggplot2::ggplot(exp_grp, ggplot2::aes(sample, fill = as.factor(meth))) + ggplot2::geom_bar() + ggplot2::ggtitle("meth data") +
  ggplot2::labs(x = "sample type", y = "# patients", fill = "data") +
  ggplot2::scale_fill_manual(labels = c("missing", "existing"),  values = c("gray", "black"))
grid_arrange_shared_legend(pt, pc, pm, ncol=3 )

## or an improved alternative:

layout(matrix(1:3,1), respect=TRUE)
barplot(table(exp_grp$trscr, exp_grp$sample)[2:1,], col=c("dimgrey", "grey"), main="transcriptomic data", ylab="#patients")
barplot(table(exp_grp$cnv, exp_grp$sample)[2:1,], col=c("dimgrey", "grey"), main="genomic data")
barplot(table(exp_grp$meth, exp_grp$sample)[2:1,], col=c("dimgrey", "grey"), main="methylome data")
legend("topright", c("existing", "missing"), col=c("dimgrey", "grey"), pch=15)

FCh: What does "11" and "01" means?

Method

dmprocr performs a 4-steps analysis:

Identification of differentially expressed region
Differential methylation analysis
Comparison and classification of differential methylation (DM) profiles
Enrichment analysis

This is a tutorial describing how to use dmprocr package.

TO DO: Les noms des fonctions ne sont pas homogènes. Je ne sais pas s il faut corriger cela...

TO DO: Les examples des fonctions de Paul (étapes 2 et 3) sont réalisés à partir de la fonction dmRandomDataset. Il faudra tout mettre à jour avec la fonction generate_fakestudy qui crée une veritable étude simulée avec les critères que l on avait définis.

library(dmprocr)

Identification of differentially expressed region

TO DO: Implementer 3.1 et 3.2 avec des données de puces (Flo?)

TO DO: Optimiser le calcul des étapes 3.1 et 3.2. DEseq est long, je sais pas si on peut faire autrement. Pour la regression linéaire, je pense qu il y a moyen de faire beaucoup plus vite par contre.

Normalisation of the expression data

Using DESeq2 method, we are normalizing expression data.

dir.create("tmp", showWarnings=FALSE)
if (!file.exists('tmp/dmprocrdata_norm_trscr.rds')){
norm_trscr = RNAseq_normalization(
  data_trscr = data_trscr         , 
  exp_grp = exp_grp               , 
  gene_list = gene_list           , 
  alpha = 0.05                    , 
  contrast = c("tissue_status","patho","normal")
)
saveRDS(norm_trscr, 'tmp/dmprocrdata_norm_trscr.rds')
}
norm_trscr = readRDS('tmp/dmprocrdata_norm_trscr.rds')

data_ntrscr = norm_trscr$data_ntrscr
dim(data_ntrscr)

TO DO: RNAseq_diffAnalysis could be fast - using paralle option of DESeq (to do) - using a other test that retuirn fc and pval. (to be discussed)

Global method

Using linear regression, we are selecting genes that display a significant variation of expression in tumour cells, without any change in DNA copy number. This step allows to select gene whose observed deregulation cannot be explained by structural variation and which are putative candidates for an epigenomic deregulation.

This method enables selection of candidate regions based on 3 criteria:

the z-score of the linear regression
the log2FoldChange of indivuals without cnv in the region
the log2FoldChange calculated by DEseq

This criteria are present in the dmprocr$result_list

Linear regression counts ~ cnv

if (!file.exists("tmp/dmprocrdata_result_lin_reg.rds")){
  result_lin_reg = RNAseq_cnv_reg (
    data_ntrscr = data_ntrscr    ,
    data_cnv = data_cnv                      , 
    exp_grp = exp_grp                        , 
    apply_func = epimedtools::monitored_apply,
    gene_list = gene_list
  )
  saveRDS(result_lin_reg, "tmp/dmprocrdata_result_lin_reg.rds")
}

result_lin_reg  = readRDS("tmp/dmprocrdata_result_lin_reg.rds")
head(result_lin_reg)

#plot the result
# padj_thresh = 0.001
# size_vec = result_lin_reg$padj < padj_thresh
# size_vec= as.vector(as.factor(size_vec))
# size_vec[size_vec == FALSE] = "0.5"
# size_vec[size_vec == TRUE] = "1.5"
# size_vec[is.na(size_vec)] = "0"
# z_score_thresh = 0.3
# col_vec = result_lin_reg$z_score > (-z_score_thresh) & result_lin_reg$z_score < z_score_thresh 
# ggplot2::ggplot(result_lin_reg, ggplot2::aes(y=-log10(padj), x=log2FoldChange , size = as.factor(size_vec), colour = col_vec )) + ggplot2::geom_point() + 
#  ggplot2::scale_colour_manual(labels = c("ns","0.3<zscore<0.001", "na"), values = c("grey","red", "lightgray")) +
#  ggplot2::geom_point() + ggplot2::geom_hline(yintercept = -log10(padj_thresh)) +
#   ggplot2::scale_size_manual(labels = c("ns", "NA","padj<0.001"), values = c(0,0, 2)) 
# 


# anadiff_res = dmprocrdata$result_list
# col = rep(1, nrow(anadiff_res))
# names(col) = rownames(anadiff_res)
# idx = rownames(result_lin_reg)[abs(result_lin_reg$z_score) < z_score_thresh]
# col[idx] = 2
# 
# layout(1, respect=TRUE)
# plot(density(result_lin_reg$z_score, na.rm=TRUE))
# plot(anadiff_res$log2FoldChange, -log10(anadiff_res$padj),
#   pch=16, col=adjustcolor(col, alpha.f=0.1),
#   xlab="l2fc", ylab="-log10(padj)", main=paste(nrow(anadiff_res), "genes")
# )
# points(anadiff_res[idx,]$log2FoldChange, -log10(anadiff_res[idx,]$padj), col=2)
# abline(h=-log10(padj_thresh), v=c(-log2(1.5), log2(1.5)), col=1, lty=2)
# legend("topright", pch=1, col=2, legend=paste0("abs(z_score)<", z_score_thresh, " (", length(idx), " genes)"))

DE analysis on no-cnv samples

if (!file.exists('tmp/dmprocrdata_result_lin_reg_no_cnv.rds')){
  result_lin_reg_no_cnv = noCNV_diffAnalysis (data_ntrscr = data_ntrscr,
                                       data_cnv = data_cnv, 
                                       exp_grp = exp_grp, 
                                       gene_list = result_lin_reg)
  saveRDS(result_lin_reg_no_cnv, 'tmp/dmprocrdata_result_lin_reg_no_cnv.rds')
}
result_lin_reg_no_cnv = readRDS('tmp/dmprocrdata_result_lin_reg_no_cnv.rds')

head(result_lin_reg_no_cnv)
dim(result_lin_reg_no_cnv)

#plot the result
# padj_thresh = 0.001
# z_score_thresh = 0.3
# noCNVlog2FC_thres = 2
# size_vec = result_lin_reg_no_cnv$padj < padj_thresh
# col_vec = result_lin_reg_no_cnv$z_score > (-z_score_thresh) & result_lin_reg_no_cnv$z_score < z_score_thresh 
# shape_vec = (result_lin_reg_no_cnv$noCNVlog2FC > noCNVlog2FC_thres | result_lin_reg_no_cnv$noCNVlog2FC < (-noCNVlog2FC_thres) ) 
# size_vec= as.vector(as.factor(size_vec))
# size_vec[size_vec == FALSE] = "1"
# size_vec[size_vec == TRUE] = "2"
# size_vec[is.na(size_vec)] = "0"
# ggplot2::ggplot(data=result_lin_reg_no_cnv, ggplot2::aes(y=-log10(padj), x=log2FoldChange, colour=col_vec, shape = shape_vec, size = as.factor(size_vec))) + ggplot2::geom_point() + 
#   ggplot2::scale_shape_manual(labels = c("ns","noCNVlog2FC<-2 & noCNVlog2FC>2", "NA"), values = c(3,18, 18)) + 
#   ggplot2::scale_colour_manual(labels = c("ns","0.3<zscore<0.001", "na"), values = c("grey","red", "lightgray")) +
#   ggplot2::geom_point() + ggplot2::geom_hline(yintercept = -log10(padj_thresh)) +
#   ggplot2::scale_size_manual(labels = c("ns", "NA","padj<0.001"), values = c(0,0, 2))

DE with DESeq of candidate genes

Select candidate to perfom a fast analysis

candidates = select_candidates(gene_list = result_lin_reg_no_cnv,
                               z_score_thresh = 0.3, 
                               noCNVlog2FC_thresh= 2)
dim(candidates)
head(candidates)

We perform DESeq2 analysis only on a subset of candidate genes. To be more accurate in the normalization process, with use the size factors calculated on the complete dataset in part 2.3

norm_matrix = matrix(data = rep(norm_trscr$size_factor, dim(candidates)[1]), dim(candidates)[1], length(norm_trscr$size_factor),byrow = TRUE)

dir.create("tmp", showWarnings=FALSE)
if (!file.exists('tmp/dmprocrdata.rds')){
dmprocrdata = RNAseq_diffAnalysis(
  data_trscr = data_trscr         , 
  exp_grp = exp_grp               , 
  gene_list = candidates           , 
  alpha = 0.05                    , 
  contrast = c("tissue_status","patho","normal"),
  normalization_factor = norm_matrix
)
saveRDS(dmprocrdata, 'tmp/dmprocrdata.rds')
}
dmprocrdata = readRDS('tmp/dmprocrdata.rds')



#plot the result.
padj_thresh = 0.001
size_vec = dmprocrdata$result_list$padj < padj_thresh
size_vec= as.vector(as.factor(size_vec))
size_vec[size_vec == FALSE] = "0.5"
size_vec[size_vec == TRUE] = "1.5"
size_vec[is.na(size_vec)] = "0"
ggplot2::ggplot(dmprocrdata$result_list, ggplot2::aes(y=-log10(padj), x=log2FoldChange , size = as.factor(size_vec) )) + ggplot2::geom_point() + ggplot2::geom_hline(yintercept = -log10(padj_thresh)) +
ggplot2::scale_size_manual(labels = c("ns", "NA","padj<0.001"), values = c(0,0, 2)) 


# fad_dmprocrdata = fad_RNAseq_diffAnalysis(
#   data_trscr = data_trscr                                         ,
#   exp_grp = exp_grp                                               ,
#   gene_list = gene_list                                           ,
#   apply_func = epimedtools::monitored_apply                       ,
#   contrast = c("sample","01","11")
# )

# layout(matrix(1:3, 1), respect=TRUE)
#
# plot(fad_dmprocrdata$result_list$log2FoldChange, dmprocrdata$result_list$log2FoldChange)
# abline(h=c(-log2(1.5), log2(1.5)), v=c(-log2(1.5), log2(1.5)), col=2)
#
# plot(-log10(fad_dmprocrdata$result_list$padj), -log10(dmprocrdata$result_list$padj), pch=".")
# abline(h=-log10(0.05), v=-log10(0.05), col=2)
# abline(h=-log10(padj_thresh), v=-log10(padj_thresh), col=2)

anadiff_res = dmprocrdata$result_list
plot(anadiff_res$log2FoldChange, -log10(anadiff_res$padj),
  pch=16, col=adjustcolor(1, alpha.f=0.1),
  xlab="l2fc", ylab="-log10(padj)", main=paste(nrow(anadiff_res), "genes")
)
abline(h=-log10(padj_thresh), v=c(-log2(1.5), log2(1.5)), col=1, lty=2)

Generates a candidate table

candidates = select_candidates(gene_list = dmprocrdata$result_list,
                               padj_thresh = 0.001, 
                               z_score_thresh = 0.3, 
                               noCNVlog2FC_thres = 2)
dim(candidates)

TO DO: FCh: Plot a Venn diagram of each filter (padj_thresh, z_score_thresh, noCNVlog2FC_thres)

Personalized method

This method enables selection of candidate regions based on 2 criteria:

TO DO: developper la méthode. J aimerais génerer une table du type data_trscr avec 1 pour diffenretiellement exprimé et 0 pour les autres. L analyse est a effectuer sur les données normalisées par DEseq

Definition of DE samples

This method is based on penda R package.

Selection of DE samples with no CNVs

Differential methylation analysis

For each candidate gene, we establish a profile of differential methylation based on the interpolation of methylation values between all the probes in the neighborhood of the TSS of that gene (VanderKraats, PMID: 23748561).

First we define the list of candidates and the corresponding BED file.

#gene_list_candidates = candidates[1:100,]
#random_idx = round(runif(dim(candidates)[1], min=1, max =dim(gene_list)[1]))
#random_idx = round(runif(100, min=1, max =dim(gene_list)[1])) #take 100 genes randomly
idx_to_remove = rownames(gene_list) %in% rownames(candidates)
gene_list_no_candidates = gene_list[!idx_to_remove, ]
dim(gene_list_no_candidates)
random_genes = sample(rownames(gene_list_no_candidates), 100)

test_names = c(rownames(candidates), random_genes)

#length(which(rownames(gene_list[random_idx,]) %in% rownames(gene_list[random_idx,])))
label_vec = c(rep("cand", dim(candidates)[1]), rep("rand", length(random_genes)))
gene_list_candidates = gene_list[test_names, ]
gene_list_candidates$label = label_vec
dim(gene_list_candidates)
head(gene_list_candidates)

TO DO: Homogénéiser la typo du code (à partir d ici j ai repris le code de paul en changeant la typo mais ce n est pas suffisant)

Prepare a differential methylation table

First we define a list of control methylation samples ("11") and tumoral methylation samples ("01").

TO DO: on pourrait utiliser ici le meme concept que ce que j'ai fait dans les fonctions précédentes, c'est à dire spécifier un contrast=c("sample", "01", "11") au lieu de vecteur de reférence idx_meth_tumor et ControRef...)

#idx_meth_tumor = rownames(exp_grp)[exp_grp$meth == 1 & exp_grp$sample == "01"]
#idx_meth_ctrl = rownames(exp_grp)[exp_grp$meth == 1 & exp_grp$sample == "11"]

The we generate the differential methylation table.

TO DO: Il faut modifier la fonction pour pouvoir ne générer la table de methylation différentielle que pour les sondes correspondant aux gènes candidats. Ici la table est calculée pour toutes les sondes de la platform

diff_meth_data = compute_pop_dm_table( gene_list = gene_list,
                                       exp_grp = exp_grp,
                                       data_meth = data_meth,
                                       contrast = c("tissue_status","patho","normal") ,
                                       pf_meth = pf_meth,
                                       pf_pos_colname = "position",
                                       pf_chr_colname = "seqnames",
                                       updwn_str = 5000,         
                                       slide = 500, 
                                       apply_func = epimedtools::monitored_apply)

dim(diff_meth_data)

RQ: la fonction de Paul renvoie une liste contenant data, platform et exp_grp. Je ne pense pas que cela soit nécessaire. Il me semblerait pertinent de modifier le code pour ne renvoyer qu une table de methylation differentielle. Mais il faudra verifier que cela fonctionne dans le reste des fonctions

3.2.2 Generate the corresponding differential methylation profiles

Then we generate the differential methylatin profiles of interest, by specifying the minimum number of probes to consider a gene as informative (nbProbes), and the bining window(slide).

TO DO: Attention, la fonction getalldmProfile va chercher en dur la colonne geneSymbol du bedfile, alors que celle ci n existe pas. Il faudra modifier cela selon si on decide d en faire une variable dynamique dans tout le package, ou de l appeler gene_id (comme dans le fichier gene_list de dmprocrdata) et de l imposer dans le structure des données

#Identifies probes for each gene of interest
pf_pos_colname = "position"
pf_chr_colname = "seqnames"
updwn_str = 5000
slide = 500
filename = 'tmp/profiles.rds'
probe_idx = epimedtools::monitored_apply(gene_list_candidates, 1, get_probe_names, pf_meth = pf_meth, pf_chr_colname = pf_chr_colname, pf_pos_colname = pf_pos_colname, up_str=updwn_str+slide, dwn_str=updwn_str+slide)

#Filter genes with sufficient ammount of probes
nb_probes = 7
idx = sapply(probe_idx, length) > nb_probes


#Compute the profile on genes of interest
tmp_candidates = gene_list_candidates[idx, ] #bed_file of genes of interest
probe_idx = probe_idx[idx]
probe_idx = unique(unlist(probe_idx))
tmp_pf_meth = pf_meth[probe_idx,] #methylation platform corresponding to tmp_candidates bed file

#gene_list_candidates = gene_list_candidates[idx, ]
#probe_idx = probe_idx[idx]
#probe_idx = unique(unlist(probe_idx))
#names(probe_idx) = NULL

#meth_platform = pf_meth[probe_idx,]
#meth_platform = meth_platform[order(meth_platform[[pf_chr_colname]], meth_platform[[pf_pos_colname]]),]
#data_meth = data_meth[rownames(data_meth),]


pf_pos_colname = "position"
pf_chr_colname = "seqnames"
updwn_str = 5000           
wig_size = 20       
slide = 500          
mask_wide = 200  
filename = 'tmp/profiles.rds'

if (!file.exists(filename)){

  profiles = epimedtools::monitored_apply(mod=1,
    tmp_candidates, 1, compute_gene_meth_profile, 
    exp_grp = exp_grp,
    type_of_analysis = "pop", #comment in performing individual analysis
    #type_of_analysis = "indiv", #comment in performing population analysis
    contrast=c("sample","01","11"), #comment in performing individual analysis
    #indiv_filtering_matrix = DE_matrix, #comment in performing population analysis
    data_meth = diff_meth_data                          , 
    pf_meth = tmp_pf_meth                       , 
    pf_pos_colname=pf_pos_colname                     , 
    pf_chr_colname=pf_chr_colname                     , 
    mask_wide=mask_wide                               ,
    updwn_str=updwn_str,
    slide=slide, 
    wig_size=wig_size
  )
  saveRDS(profiles, filename)
}
profiles = readRDS(filename)

Here is a differential methylation profile example:

plot_meth_profile(profiles[[1]])

Comparison and classification of differential methylation (DM) profiles

We then define a specific metric that quantifies the distance between each pair of genes. Based on this metric, we perform hierarchical clustering on the distance matrix. This step allows to define groups of genes with similar DM profiles.

Calculation of the distance between each profile

dist_mat_orig = dmDistance(profiles = profiles)
geneNames    <- getProfileListID(dmprofileList = profiles)
head(dist_mat_orig)
dim(dist_mat_orig)
dist_mat = dist_mat_orig

sum(is.na(dist_mat))
dist_mat[is.na(dist_mat)] = mean(dist_mat, na.rm=TRUE)
fit = cmdscale(dist_mat, eig = TRUE, k = 2)

x = fit$points[, 1]
y = fit$points[, 2]

col_vec = as.factor(gene_list_candidates[names(geneNames), ]$label)

plot(x, y, col=col_vec, pch=18)
legend("topright", legend = c("candidates","ctrl"), text.col=c("black", "red"))

# list_dist_mat = dmDistance_translocate(dmprofileList = profiles, slide = 500)

Hierarchical clustering of the disctance matrix

TO DO: Il y a un probleme avec la fonction clustdmProfile, la selection manuelle des clusters ne fonctionne qu a partir de 20 genes. Il faut regler ce bug, et voir si la seletion manuelle des cluster est bien ce qu il y a de mieux adapté. Sinon ce serait bien que cette fonction renvoie la gene_list initiale aggrémentée d une colonne cluster avec le numero des clusters associés.

# geneNames    <- getProfileListID(dmprofileList = profiles)
#clust_result <- clustdmProfile(mat = dist_mat, geneLabels = geneNames)

3.4. Enrichment analysis

On each group of genes identified by the clustering having a significant demethylation, we perform standard enrichment analysis to estimate the functionality of each group. Then, we infer a typical epigenomic profile for each group in normal tissues to relate regional epigenome properties in normal tissue to the risk to be an epigenetically regulated domains in lung cancer.

magrichard/dmprocrdata documentation built on May 24, 2019, 5:07 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

magrichard/dmprocrdata
A set of data dedicated to the dmprocr R package

In magrichard/dmprocrdata: A set of data dedicated to the dmprocr R package

Introduction

Dataset

data_trscr

data_cnv

data_meth

gene_list

pf_meth

exp_grp

Missing data summary

Method

Identification of differentially expressed region

Normalisation of the expression data

Global method

Linear regression counts ~ cnv

DE analysis on no-cnv samples

DE with DESeq of candidate genes

Generates a candidate table

Personalized method

Definition of DE samples

Selection of DE samples with no CNVs

Differential methylation analysis

Prepare a differential methylation table

3.2.2 Generate the corresponding differential methylation profiles

Comparison and classification of differential methylation (DM) profiles

Calculation of the distance between each profile

Hierarchical clustering of the disctance matrix

3.4. Enrichment analysis

R Package Documentation

Browse R Packages

We want your feedback!

magrichard/dmprocrdata A set of data dedicated to the dmprocr R package

In magrichard/dmprocrdata: A set of data dedicated to the dmprocr R package

Introduction

Dataset

data_trscr

data_cnv

data_meth

gene_list

pf_meth

exp_grp

Missing data summary

Method

Identification of differentially expressed region

Normalisation of the expression data

Global method

Linear regression counts ~ cnv

DE analysis on no-cnv samples

DE with DESeq of candidate genes

Generates a candidate table

Personalized method

Definition of DE samples

Selection of DE samples with no CNVs

Differential methylation analysis

Prepare a differential methylation table

3.2.2 Generate the corresponding differential methylation profiles

Comparison and classification of differential methylation (DM) profiles

Calculation of the distance between each profile

Hierarchical clustering of the disctance matrix

3.4. Enrichment analysis

R Package Documentation

Browse R Packages

We want your feedback!

magrichard/dmprocrdata
A set of data dedicated to the dmprocr R package