data-raw/README.md

title: "Example data for MLML2R package" output: html_document: keep_md: yes toc: true number_sections: true

Getting publicly available data

We will use the dataset from Field et al. (2015), which consists of eight DNA samples from the same DNA source treated with oxBS-BS and hybridized to the Infinium 450K array.

The steps shown in this section follows the vignette from minfi package.

We start with the steps to get the raw data from the GEO repository. The dataset from Field et al. (2015) is available at GEO accession GSE63179.

The sample was divided into four BS and four oxBS replicates.

Platform used: GPL16304 Illumina HumanMethylation450 BeadChip [UBC enhanced annotation v1.0]

Samples:

This example has the following dependencies:

library(minfi)
library(GEOquery)

Use the following commands to install these packages in R:

source("http://www.bioconductor.org/biocLite.R")
biocLite(c("minfi", "GEOquery"))
## https://ftp.ncbi.nlm.nih.gov/geo/series/GSE63nnn/GSE63179/suppl/
## OK
## Warning in download.file(file.path(url, i), destfile =
## file.path(storedir, : URL https://ftp.ncbi.nlm.nih.gov/geo/series/GSE63nnn/
## GSE63179/suppl///geo/series/GSE63nnn/GSE63179/: cannot open destfile '/
## Users/imac/Documents/GitHub/MLML2R/data-raw/GSE63179//geo/series/GSE63nnn/
## GSE63179/', reason 'No such file or directory'
## Warning in download.file(file.path(url, i), destfile =
## file.path(storedir, : download had nonzero exit status
## [1] "GSM1543269_9373551079_R01C01_Grn.idat.gz"
## [2] "GSM1543269_9373551079_R01C01_Red.idat.gz"
## [3] "GSM1543270_9373551079_R01C02_Grn.idat.gz"
## [4] "GSM1543270_9373551079_R01C02_Red.idat.gz"
## [5] "GSM1543271_9373551079_R02C01_Grn.idat.gz"
## [6] "GSM1543271_9373551079_R02C01_Red.idat.gz"
getGEOSuppFiles("GSE63179")
untar("GSE63179/GSE63179_RAW.tar", exdir = "GSE63179/idat")
head(list.files("GSE63179/idat", pattern = "idat"))

Decompress the compressed IDAT files:

idatFiles <- list.files("GSE63179/idat", pattern = "idat.gz$", full = TRUE)
sapply(idatFiles, gunzip, overwrite = TRUE)
## GSE63179/idat/GSM1543269_9373551079_R01C01_Grn.idat.gz 
##                                                8095291 
## GSE63179/idat/GSM1543269_9373551079_R01C01_Red.idat.gz 
##                                                8095291 
## GSE63179/idat/GSM1543270_9373551079_R01C02_Grn.idat.gz 
##                                                8095286 
## GSE63179/idat/GSM1543270_9373551079_R01C02_Red.idat.gz 
##                                                8095286 
## GSE63179/idat/GSM1543271_9373551079_R02C01_Grn.idat.gz 
##                                                8095284 
## GSE63179/idat/GSM1543271_9373551079_R02C01_Red.idat.gz 
##                                                8095284 
## GSE63179/idat/GSM1543272_9373551079_R02C02_Grn.idat.gz 
##                                                8095285 
## GSE63179/idat/GSM1543272_9373551079_R02C02_Red.idat.gz 
##                                                8095285 
## GSE63179/idat/GSM1543273_9373551079_R03C01_Grn.idat.gz 
##                                                8095286 
## GSE63179/idat/GSM1543273_9373551079_R03C01_Red.idat.gz 
##                                                8095286 
## GSE63179/idat/GSM1543274_9373551079_R04C01_Grn.idat.gz 
##                                                8095287 
## GSE63179/idat/GSM1543274_9373551079_R04C01_Red.idat.gz 
##                                                8095287 
## GSE63179/idat/GSM1543275_9373551079_R05C01_Grn.idat.gz 
##                                                8095290 
## GSE63179/idat/GSM1543275_9373551079_R05C01_Red.idat.gz 
##                                                8095290 
## GSE63179/idat/GSM1543276_9373551079_R06C01_Grn.idat.gz 
##                                                8095287 
## GSE63179/idat/GSM1543276_9373551079_R06C01_Red.idat.gz 
##                                                8095287

Now we read the IDAT files in the directory:

rgSet <- read.metharray.exp("GSE63179/idat")
rgSet
## class: RGChannelSet 
## dim: 622399 8 
## metadata(0):
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(8): GSM1543269_9373551079_R01C01
##   GSM1543270_9373551079_R01C02 ... GSM1543275_9373551079_R05C01
##   GSM1543276_9373551079_R06C01
## colData names(0):
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
pData(rgSet)
## DataFrame with 8 rows and 0 columns
sampleNames(rgSet)
## [1] "GSM1543269_9373551079_R01C01" "GSM1543270_9373551079_R01C02"
## [3] "GSM1543271_9373551079_R02C01" "GSM1543272_9373551079_R02C02"
## [5] "GSM1543273_9373551079_R03C01" "GSM1543274_9373551079_R04C01"
## [7] "GSM1543275_9373551079_R05C01" "GSM1543276_9373551079_R06C01"

The file names consists of a GEO identifier (the GSM part) followed by a standard IDAT naming convention with a 10 digit number which is an array identifier followed by an identifier of the form R01C01. This is because each array actually allows for the hybridization of 12 samples in a 6x2 arrangement. The 9373551079_R01C01 means row 1 and column 1 on chip 9373551079.

We need to identify the samples from different methods: BS-conversion, oxBS-conversion.

geoMat <- getGEO("GSE63179")
pD.all <- pData(geoMat[[1]])
pD <- pD.all[, c("title", "geo_accession", "characteristics_ch1.1", "characteristics_ch1.2","characteristics_ch1.3")]
pD
##                   title geo_accession characteristics_ch1.1
## GSM1543269   brain-BS-1    GSM1543269          gender: Male
## GSM1543270 brain-oxBS-3    GSM1543270          gender: Male
## GSM1543271   brain-BS-2    GSM1543271          gender: Male
## GSM1543272 brain-oxBS-4    GSM1543272          gender: Male
## GSM1543273   brain-BS-3    GSM1543273          gender: Male
## GSM1543274   brain-BS-4    GSM1543274          gender: Male
## GSM1543275 brain-oxBS-1    GSM1543275          gender: Male
## GSM1543276 brain-oxBS-2    GSM1543276          gender: Male
##            characteristics_ch1.2 characteristics_ch1.3
## GSM1543269               age: 25    bisulfite_proc: BS
## GSM1543270               age: 25  bisulfite_proc: oxBS
## GSM1543271               age: 25    bisulfite_proc: BS
## GSM1543272               age: 25  bisulfite_proc: oxBS
## GSM1543273               age: 25    bisulfite_proc: BS
## GSM1543274               age: 25    bisulfite_proc: BS
## GSM1543275               age: 25  bisulfite_proc: oxBS
## GSM1543276               age: 25  bisulfite_proc: oxBS
names(pD)[c(3,4,5)] <- c("gender", "age","method")
pD$gender <- sub("^gender: ", "", pD$gender)
pD$age <- sub("^age: ", "", pD$age)
pD$method <- sub("^bisulfite_proc: ","",pD$method)

We now need to merge this pheno data into the methylation data. The following are commands to make sure we have the same row identifier in both datasets before merging.

sampleNames(rgSet) <- sapply(sampleNames(rgSet),function(x) strsplit(x,"_")[[1]][1])
rownames(pD) <- pD$geo_accession
pD <- pD[sampleNames(rgSet),]
pData(rgSet) <- as(pD,"DataFrame")
rgSet
## class: RGChannelSet 
## dim: 622399 8 
## metadata(0):
## assays(2): Green Red
## rownames(622399): 10600313 10600322 ... 74810490 74810492
## rowData names(0):
## colnames(8): GSM1543269 GSM1543270 ... GSM1543275 GSM1543276
## colData names(5): title geo_accession gender age method
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19

Preprocessing

We refer the reader to the minfi package tutorials for more preprocessing options.

We need to install the required package bellow:

source("https://bioconductor.org/biocLite.R")
biocLite("IlluminaHumanMethylation450kmanifest")

First, we removed probes with detection p-value <0.01 in any of the 8 arrays. The function detectionP identifies failed positions defined as both the methylated and unmethylated channel reporting background signal levels.

detP <- detectionP(rgSet)
failed <- detP >0.01
## Keep probes which failed in at most maxFail arrays (0 = the probe passed in all arrays)
maxFail<- 0
keep_probes <- rowSums(failed) <= maxFail

We kept $83\%$ of the probes according to this criterion.

The rgSet object is a class called RGChannelSet which represents two color data with a green and a red channel. We will use, as input in the MLML funcion, a MethylSet, which contains the methylated and unmethylated signals. The most basic way to construct a MethylSet is to using the function preprocessRaw which uses the array design to match up the different probes and color channels to construct the methylated and unmethylated signals. Here we will use the preprocessNoob function, which does the preprocessing and returns a MethylSet.

Arrays were then normalized using the Noob/ssNoob preprocessing method for Infinium methylation microarrays.

From a MethylSet it is easy to compute Beta values, defined as:

Beta = Meth / (Meth + Unmeth + c)

The c constant is chosen to avoid dividing with small values. Illumina uses a default of c=100. The function getBeta from minfi package can be used to obtain the Beta values.

MSet.noob<- preprocessNoob(rgSet[keep_probes,])
## [preprocessNoob] Applying R/G ratio flip to fix dye bias...

Prepare de input data:

MethylatedBS <- getMeth(MSet.noob)[,c(1,3,5,6)]
UnMethylatedBS <- getUnmeth(MSet.noob)[,c(1,3,5,6)]

MethylatedOxBS <- getMeth(MSet.noob)[,c(7,8,2,4)]
UnMethylatedOxBS <- getUnmeth(MSet.noob)[,c(7,8,2,4)]

This was included in the package as example:

set.seed(112017)

a <- sample(1:dim(MethylatedBS)[1],100,replace=FALSE)

MethylatedBS <- MethylatedBS[a,1:2]
UnMethylatedBS <- UnMethylatedBS[a,1:2]
MethylatedOxBS <- MethylatedOxBS[a,1:2]
UnMethylatedOxBS <- UnMethylatedOxBS[a,1:2]

colnames(MethylatedBS) <- c("sample1","sample2")
colnames(UnMethylatedBS) <- c("sample1","sample2")
colnames(MethylatedOxBS) <- c("sample1","sample2")
colnames(UnMethylatedOxBS) <- c("sample1","sample2")

N_BS <- round(MethylatedBS+UnMethylatedBS)

N_OxBS <- round(MethylatedOxBS+UnMethylatedOxBS)

N_TAB <- pmax(N_BS,N_OxBS)

p_m=.3
p_h=0.2
p_u=.5

true_parameters_sim <- data.frame(p_m=.3,p_h=.2,p_u=.5)
save(true_parameters_sim,file="true_parameters_sim.rda")

set.seed(2017)
MethylatedBS_sim <- apply(N_BS, c(1,2), function(x) rbinom(n=1, size=x, prob=(p_m+p_h)))

UnMethylatedBS_sim <- N_BS - MethylatedBS_sim

MethylatedOxBS_sim <- apply(N_OxBS, c(1,2), function(x) rbinom(n=1, size=x, prob=p_m))

UnMethylatedOxBS_sim <- N_OxBS - MethylatedOxBS_sim

MethylatedTAB_sim <- apply(N_TAB, c(1,2), function(x) rbinom(n=1, size=x, prob=p_h))

UnMethylatedTAB_sim <- N_TAB - MethylatedTAB_sim


save(MethylatedBS_sim,file="MethylatedBS_sim.rda")
save(UnMethylatedBS_sim,file="UnMethylatedBS_sim.rda")
save(MethylatedOxBS_sim,file="MethylatedOxBS_sim.rda")
save(UnMethylatedOxBS_sim,file="UnMethylatedOxBS_sim.rda")
save(MethylatedTAB_sim,file="MethylatedTAB_sim.rda")
save(UnMethylatedTAB_sim,file="UnMethylatedTAB_sim.rda")

Second Example Dataset

MethylatedBS <- getMeth(MSet.noob)[,c(1,3,5,6)]
UnMethylatedBS <- getUnmeth(MSet.noob)[,c(1,3,5,6)]

MethylatedOxBS <- getMeth(MSet.noob)[,c(7,8,2,4)]
UnMethylatedOxBS <- getUnmeth(MSet.noob)[,c(7,8,2,4)]

colnames(MethylatedBS) <- c("sample1","sample2","sample3","sample4")
colnames(UnMethylatedBS) <- c("sample1","sample2","sample3","sample4")
colnames(MethylatedOxBS) <- c("sample1","sample2","sample3","sample4")
colnames(UnMethylatedOxBS) <- c("sample1","sample2","sample3","sample4")

CpG <- rownames(MethylatedBS)

N_BS <- round(MethylatedBS+UnMethylatedBS)

N_OxBS <- round(MethylatedOxBS+UnMethylatedOxBS)

N_TAB <- pmax(N_BS,N_OxBS)

# Use estimates from dataset as base for the true parameters in the simulation
library(MLML2R)
## 
## Attaching package: 'MLML2R'
## The following objects are masked _by_ '.GlobalEnv':
## 
##     MethylatedBS_sim, MethylatedOxBS_sim, MethylatedTAB_sim,
##     UnMethylatedBS_sim, UnMethylatedOxBS_sim, UnMethylatedTAB_sim
results_exact <- MLML(T = MethylatedBS , U = UnMethylatedBS, L = UnMethylatedOxBS, M = MethylatedOxBS)


set.seed(2017)

temp1 <- data.frame(n=as.vector(N_BS),p_m=c(results_exact$mC[,1],results_exact$mC[,1],results_exact$mC[,1],results_exact$mC[,1]),p_h=c(results_exact$hmC[,1],results_exact$hmC[,1],results_exact$hmC[,1],results_exact$hmC[,1]))

MethylatedBS_temp <- c()
for (i in 1:dim(temp1)[1])
{
  MethylatedBS_temp[i] <- rbinom(n=1, size=temp1$n[i], prob=(temp1$p_m[i]+temp1$p_h[i]))
}

UnMethylatedBS_sim2 <- matrix(N_BS - MethylatedBS_temp,ncol=4)
MethylatedBS_sim2 <- matrix(MethylatedBS_temp,ncol=4)

temp1 <- data.frame(n=as.vector(N_OxBS),p_m=c(results_exact$mC[,1],results_exact$mC[,1],results_exact$mC[,1],results_exact$mC[,1]),p_h=c(results_exact$hmC[,1],results_exact$hmC[,1],results_exact$hmC[,1],results_exact$hmC[,1]))

MethylatedOxBS_temp <- c()
for (i in 1:dim(temp1)[1])
{
  MethylatedOxBS_temp[i] <- rbinom(n=1, size=temp1$n[i], prob=temp1$p_m[i])
}

UnMethylatedOxBS_sim2 <- matrix(N_OxBS - MethylatedOxBS_temp,ncol=4)
MethylatedOxBS_sim2 <- matrix(MethylatedOxBS_temp,ncol=4)


temp1 <- data.frame(n=as.vector(N_TAB),p_m=c(results_exact$mC[,1],results_exact$mC[,1],results_exact$mC[,1],results_exact$mC[,1]),p_h=c(results_exact$hmC[,1],results_exact$hmC[,1],results_exact$hmC[,1],results_exact$hmC[,1]))

MethylatedTAB_temp <- c()
for (i in 1:dim(temp1)[1])
{
  MethylatedTAB_temp[i] <- rbinom(n=1, size=temp1$n[i], prob=temp1$p_h[i])
}


UnMethylatedTAB_sim2 <- matrix(N_TAB - MethylatedTAB_temp,ncol=4)
MethylatedTAB_sim2 <- matrix(MethylatedTAB_temp,ncol=4)


set.seed(112017)

a <- sample(1:dim(MethylatedBS)[1],1000,replace=FALSE)

MethylatedBS_sim2 <- MethylatedBS_sim2[a,]
UnMethylatedBS_sim2 <- UnMethylatedBS_sim2[a,]
MethylatedOxBS_sim2 <- MethylatedOxBS_sim2[a,]
UnMethylatedOxBS_sim2 <- UnMethylatedOxBS_sim2[a,]
MethylatedTAB_sim2 <- MethylatedTAB_sim2[a,]
UnMethylatedTAB_sim2 <- UnMethylatedTAB_sim2[a,]

colnames(MethylatedBS_sim2) <- c("sample1","sample2","sample3","sample4")
colnames(UnMethylatedBS_sim2) <- c("sample1","sample2","sample3","sample4")
colnames(MethylatedOxBS_sim2) <- c("sample1","sample2","sample3","sample4")
colnames(UnMethylatedOxBS_sim2) <- c("sample1","sample2","sample3","sample4")

rownames(MethylatedBS_sim2) <- CpG[a]
rownames(UnMethylatedBS_sim2) <- CpG[a]
rownames(MethylatedOxBS_sim2) <- CpG[a]
rownames(UnMethylatedOxBS_sim2) <- CpG[a]
rownames(MethylatedTAB_sim2) <- CpG[a]
rownames(UnMethylatedTAB_sim2) <- CpG[a]


save(MethylatedBS_sim2,file="MethylatedBS_sim2.rda")
save(UnMethylatedBS_sim2,file="UnMethylatedBS_sim2.rda")
save(MethylatedOxBS_sim2,file="MethylatedOxBS_sim2.rda")
save(UnMethylatedOxBS_sim2,file="UnMethylatedOxBS_sim2.rda")
save(MethylatedTAB_sim2,file="MethylatedTAB_sim2.rda")
save(UnMethylatedTAB_sim2,file="UnMethylatedTAB_sim2.rda")


true_parameters_sim2 <- data.frame(p_m=results_exact$mC[,1],p_h=results_exact$hmC[,1])
true_parameters_sim2$p_u <- 1-true_parameters_sim2$p_m-true_parameters_sim2$p_h
true_parameters_sim2 <- true_parameters_sim2[a,]
save(true_parameters_sim2,file="true_parameters_sim2.rda")


samarafk/MLML2R documentation built on Oct. 19, 2019, 6:04 p.m.