knitr::opts_chunk$set(echo = TRUE)

Based on DNA methylation values, the vignette shows how to estimate cell-type composition and cell type-specific methylation profiles accounting for confounder variables such as age and sex.

Data

We provide access to a simulated matrix D of a complex tissue DNA methylation values. The data frame exp_grp contains the corresponding experimental data for each patient. Here the data frame contains 22 different variables that include sex and age.

D = readRDS(url("https://zenodo.org/record/3247635/files/D.rds"))
dim(D)
head(D)[1:5, 1:5]
exp_grp = readRDS(url("https://zenodo.org/record/3247635/files/exp_grp.rds"))
dim(exp_grp)
head(exp_grp)[1:5, 1:5]

Step 1: removing correlated probes

In addition to cell-type composition, DNA methylation can vary because of other confounder variables, such as age, sex, and batch effects. We use the function CF_detection to remove probes correlated with confounding factors. By default, the function assumes that all variables contained in the argument exp_grp are potential confounders. Using a linear regression for each of the variables, the function removes probes significantly associated with confounders (false discovery threshold defined by the threshold argument).

D_CF = medepir::CF_detection(D, exp_grp, threshold = 0.15, ncores = 2)

dim(D_CF)

print(paste0("Number of correlated probes removed : ", nrow(D) - nrow(D_CF)))

Step 2: choosing the number of cell types K

To choose the number $K$ of cell types, we used the function plot_k, which performs a Principal Component Analysis and plot the eigenvalues of the probes matrix in descending order.

medepir::plot_k(D_CF)  

To select the number of principal components, we use Cattell rule that recommends to keep principal components that correspond to eigenvalues to the left of the straight line. Here, Cattell rule suggests to keep 4 principal components which corresponds to 5 cell types (the number of cell types is equal to the number of principal components plus one).

Step 3: feature selection

This step is optional. The function feature_selection select probes with the largest variance (5000 probes by default). By removing probes that do not vary, deconvolution routines can run much faster.

D_FS = medepir::feature_selection(D_CF)

dim(D_FS)

Step 4: running deconvolution methods

We propose three methods of deconvolution, of three packages: RFE to run RefFreeEWAS::RefFreeCellMix, MDC to run MeDeCom::runMeDeCom and Edec to run EDec::run_edec_stage_1.

Here we show the results obtained with RefFreeEWAS.

results_RFE = medepir::RFE(D_FS, nbcell = 5)

Results obtained with EDec and MeDeCom can be obtained by uncommenting the following code.

#results_MDC = medepir::MDC(D_FS, nbcell = 5, lambdas = c(0, 10^(-5:-1)))

#infloci = read.table(url("https://zenodo.org/record/3247635/files/inf_loci.txt"),header = FALSE, sep = "\t")
#infloci = as.vector(infloci$V1)
#results_Edec = medepir::Edec(D, nbcell = 5, infloci = infloci)

Step 5: vizualizing results

We can vizualize the matrix of cell-type proportions using a stacked bar plot.

proportions<-(results_RFE$A)
colors <- c("#A7A7A7","#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07")
barplot(proportions,names.arg=1:20,col=colors)


bcm-uga/medepir documentation built on Oct. 11, 2020, 2 a.m.