1 Introduction

This package provides a inferential analysis method for detecting differentially expressed CpG sites in MeDIP-seq data. It uses statistical framework and EM algorithm, to identify differentially expressed CpG sites. The methods on this package are described in the article Methylation-level Inferences and Detection of Differential Methylation with Medip-seq Data by Yan Zhou, Jiadi Zhu, Mingtao Zhao, Baoxue Zhang, Chunfu Jiang and Xiyan Yang (2018, pending publication).

SIMD method is developed for jointly analyzing MeDIP-seq and MRE-seq data, which are derived from methylated DNA immunoprecipitation (MeDIP) experiments followed by sequencing (MeDIP-seq) and methyl-sensitive restriction enzymes experiments for unmethylated CpGs (MRE-seq). We have implemented the SIMD method via a set of R functions with the computational intensive parts written in C. We make a R package named SIMD, which is the abbreviation of Statistical Inferences with MeDIP-seq Data, and give a tutorial for the package. The method consist three steps.

Step 1: Data Pre-processing;

Step 2: Calculating p-values of each CpG site by the EM algorithm;

Step 3: Select Significants.

We use a part of a real dataset to illustrate the usage of the SIMD package. The programs can run under the Linux system and windows 10 system. The R versions should larger than 3.5.0.

2 Preparations

Before installing SIMD package, the user must install three other R pack- ages, which can be done using the following commands:

install.packages("BiocManager")
BiocManager::install(c('edgeR', 'statmod','methylMnM'))

library(edgeR)
library(statmod)
library(methylMnM)

Next, to install the SIMD package into your R environment, start R and enter:

BiocManager::install("SIMD")

Then, the SIMD package is ready to load.

library(SIMD)

3 Data format

In order to reproduce the presented SIMD workflow, the package includes the example data sets, which is a part of a real data sets, including:

all_CpGsite_bin_chr18,
three_mre_cpg,
EM2_H1ESB1_MeDIP_sigleCpG,
EM2_H1ESB2_MeDIP_sigleCpG,
H1ESB1_MRE_sigleCpG,
and H1ESB2_MRE_sigleCpG, which is included in file example_data.RDdata in the data subdirectory of the SIMD package.

The files contain genomic regions from chromosome 18 only, as covered by short reads obtained from a MeDIP and MRE experiment of human H1ESB1 cells.

All output files are in a .bed format. The input file contain the following columns.

For MRE-seq data, we need "+" and "-" strand information in mapping process, which is general located at the sixth column of the input file. Each row represents a mapped read. These information can be extracted from the output file(s) of common mapping tools.

4 Data Pre-processing

The SIMD program requires users to provide the genome-wide MeDIP-seq and MRE-seq reads and the reference genome. The pre-processing step involves calculation of MeDIP-seq and MRE-seq count and CpG and MRE-specific count in each CpG site. The mainly steps please refer the data pre-processing steps of R package MethylMnM.

In this manual, we use example data located in the data subdirectory of the SIMD package, and the example data is processed by MethylMnM package in advance. The path is:

datafile <- system.file("extdata", package="methylMnM")
filepath <- datafile[1]

The CpG count, MRE-CpG count, MeDIP-seq count and MRE-seq count of each site are stored at:

dirwrite <- paste(setwd(getwd()), "/", sep="")

Then we compute the number of actual short reads in each CpG site by the function of EMalgorithm(), which use EM algorithm to infer the actual reads by the observation fragments. The results are saved in file writefile and reportfile. We give an example as follows, the data is from data subdirectory of the SIMD package:

data(example_data)
allcpgfile <- EM_H1ESB1_MeDIP_sigleCpG
readshort <- paste(filepath, "/H1ESB1_MeDIP_18.extended.txt", sep="")
writefile <- paste(dirwrite, "EM2_H1ESB1_MeDIP_sigleCpG.bed", sep="")
reportfile <- paste(dirwrite, "EM2_H1ESB1_MeDIP_sigleCpG_report.bed", sep="")
EMalgorithm(cpgsitefile=readshort, allcpgfile=allcpgfile, category="1",
            writefile=writefile, reportfile=reportfile)

5 p-values of SIMD test

In order to detect different methylated CpG sites, we should calculate p-value of each site. The below codes are calculate p-value of each site with function EMtest(). The input files which include "datafile", "cpgfile" and "mrecpgfile" are should have been generated by Step 1. The output file "writefile" will owneleven columns, that is, "chr", "chrSt","chrEnd", "Medip1","Medip2", "MRE1", "MRE2", "cg","mrecg","pvalue","Ts". We also output a reportfile which will include parameters "s1/s2"; "s3/s4"; "N1"; "N2"; "N3"; "N4"; "c1"; "c2" and "Spend time".

data(example_data)
data1 <- EM2_H1ESB1_MeDIP_sigleCpG
data2 <- EM2_H1ESB2_MeDIP_sigleCpG
data3 <- H1ESB1_MRE_sigleCpG
data4 <- H1ESB2_MRE_sigleCpG
datafile <- cbind(data1,data2,data3,data4)
allcpg <- all_CpGsite_bin_chr18
mrecpg <- three_mre_cpg
dirwrite <-paste(setwd(getwd()), "/", sep="")

writefile <- paste(dirwrite, "pval_EM_H1ESB1_H1ESB21.bed", sep="")
reportfile <- paste(dirwrite, "report_pvalH1ESB1_H1ESB21.bed", sep="")

EMtest(datafile=datafile, chrstring=NULL, cpgfile=allcpg,mrecpgfile=mrecpg, 
       writefile=writefile, reportfile=reportfile,mreratio=3/7, psd=2, 
       mkadded=1, f=1)

6 Select Significants

After getting p-values, we choose differentially expressed methylated CpG sites with p-values. The input file is the file which have been generated by Step 2. Then we assume the sites to be differentially expressed methylated CpGs when p-values less than pre-setting cutoffs, such as $10^{-3}$ or $10^{-5}$.



FocusPaka/SIMD documentation built on May 29, 2019, 10:57 a.m.