# In belleau/methInheritSim: Simulating Whole-Genome Inherited Bisulphite Sequencing Data

BiocStyle::markdown()
library(knitr)


Package: r Rpackage("methInheritSim")
Authors: r packageDescription("methInheritSim")[["Author"]]

# Case study

## The simulated dataset

A dataset containing methylation data (6 cases and 6 controls) has been generated using the r Rpackage("methInheritSim") package using a real dataset from Rat experiment (the real dataset is not public yet, so we used a simulation based on it). The data have been formated, using
the r Rpackage("methylkit") package, into a methylBase object (using the r Rpackage("methylkit") functions: filterByCoverage, normalizeCoverage and unite).

## Load  read DMS dataset (not in this case but normaly)
data(samplesForChrSynthetic)

## Print the first three rows of the object


## The simulation

The simulation is run using the runSim function. The outputDir parameter fixes the directory where the results are stored.

## Directory where the files related to the simulation will be saved
temp_dir <- "test_runSim"

## Run the simulation
runSim(methData = samplesForChrSynthetic,  # The dataset use for generate
# the synthetic chr.
nbSynCHR = 1,       # The number of synthetic chromosome
nbSimulation = 2,   # The number of simulation for each parameter
nbBlock = 10, nbCpG = 20, # The number of site in the
# synthetic chr is nbBLock * nbCpG
nbGeneration = 3,    # At least 2 generations must be present
vNbSample = c(3, 6), # The number of controls (= number of cases) in
# each simulation
vpDiff = c(0.9),   # Mean proportion of samples with
# differentially methylated values
vpDiffsd = c(0.1), # Standard deviation associated to vpDiff
vDiff = c(0.8),    # The shift of the mean of the C/T ratio in
# the differentially methylated sites
vInheritance = c(0.5),  # The proportion of cases that inherit
# differentially methylated sites
propInherite = 0.3,     # The proportion of diffementially methylated
# regions that are inherited
rateDiff = 0.3,    # The mean frequency of the differentially
# methylated regions
minRate = 0.2,     # The minimum rate for differentially
# methylated sites
propHetero = 0.5,  # The reduction of vDiff for the following
# generations
keepDiff = FALSE,  # When FALSE, the differentially methylated
# sites are the same in all simulations
outputDir = temp_dir, # Directory where files are saved
fileID = "S1",
runAnalysis = TRUE,
nbCores = 1,
vSeed = 32)        # Fix seed to unable reproductive results

# The files generated
dir(temp_dir)

if (dir.exists(temp_dir)) {
unlink(temp_dir, recursive = TRUE, force = FALSE)
}


# Files generated by the simulation

Three types of files are generated by default:

1. Synthetic chromosome in GRanges format ("syntheticChr" prefix) 
2. Simulation information in GRanges format ("simData" prefix) 
3. Information about DMS state ("stateDiff" prefix)

## Synthetic chromosome in GRanges format ("syntheticChr" prefix)

The first type of files contains information about the synthetic chromosome. This information is stored as a GRanges that contains the CpG (or methylated sites).The GRanges has four metadata inherited from the real dataset:

• chrOri, the chromosome from the real dataset
• startOri, the position of the site in the real dataset
• meanCTRL, the mean of the C/T proportion of the control in the real dataset
• varCTRL, the variance of the C/T proportion of the control in the real dataset

The file name is composed of those elements, separated by "_":

1. The string "syntheticChr"
2. The code of the simulation (the fileID parameter, ex: "S1"")
3. The chromosome number
4. The file extension ".rds"

An example of a valid file name: syntheticChr_S1_1.rds

## The synthetic chromosome

## In GRanges format, only Cpg present


## Simulation information in GRanges format ("simData" prefix)

The second type of files contains information about the simulation stored in a GRanges format. The GRanges object has four metadata related to real dataset:

• meanDiff, the mean of the C/T proportion for the shifted distribution
• meanCTRL.meanCTRL, the mean of the C/T proportion for the control distribution
• partitionCase, the number of cases simulated with the shifted distribution
• partitionCtrl, the number of cases simulated with the control distribution

Plus a metadata for each sample (case or control):

• case.V[number] or ctrl.V[number] the simulated proportion of C/T

The file name is composed of those elements, separated by "_":

1. The string "simData"
2. The code of the simulation (the fileID parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the nbSynCHR)
4. The number of controls as specified by the vNBSample parameter
5. The value of the vpDiff parameter
6. The value of the vDiff parameter
7. The value of the vInheritance parameter
8. The ID of the simulation (between 1 and the value of the nbSimulation)
9. The file extension ".rds"

An example of a valid file name: simData_S1_1_3_0.9_0.8_0.5_1.rds

#### The simulation dataset

#### Information for the first generation F1

#### Information for the second generation F2


## Information about DMS state ("stateDiff" prefix)

The third type of files contains a list with 2 entries. The first entry is called stateDiff and contains a vector of integer (0 and 1) with a length corresponding the length of stateInfo object. The statDiff object indicates, using a 1, the positions where the CpG sites are differentially methylated. The second entry is called statInherite and contains a vector of integer (0 and 1) with a length corresponding the length of stateInfo. The statInherite indicates, using a 1, the positions where the CpG values are inherited.

The file name is composed of those elements, separated by "_":

1. The string "stateDiff"
2. The code of the simulation (the fileID parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the nbSynCHR)
4. The number of controls as specified by the vNBSample parameter
5. The value of the vpDiff parameter
6. The value of the vDiff parameter
7. The value of the vInheritance parameter
8. The ID of the simulation (between 1 and the value of the nbSimulation)
9. The file extension ".rds"

An example of a valid file name: stateDiff_S1_1_3_0.9_0.8_0.5_1.rds

#### The DMS state information

#### In stateDiff, the position of DMS is indicated by 1
#### in stateInherite, the position of inherited DMS is indicated by 1


## Files related to saveGRanges parameter

When saveGRanges parameter is TRUE, the package saves two extra types of files:

1. Raw methylation data for all samples in GRanges format ("methylGR" prefix)
2. Information about controls and cases ("treatment" prefix)

### Raw methylation data for all samples in GRanges format ("methylGR" prefix)

The first type of files is generated for each simulation and contains a list of GRangesList. The length of the list corresponds to the number of generations (as specified by the nbGeneration paramater). The generations are stored in order (first entry = first generation, second entry = second generation, etc..). All samples related to one generations are stored in a GRangesList object. The GRangesList object contains a list of GRanges. Each GRanges stores the raw methylation data of one sample.

There is one file per simulation. The file name is composed of those elements, separated by "_":

1. The string "methylGR"
2. The code of the simulation (the fileID parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the nbSynCHR)
4. The number of controls as specified by the vNBSample parameter
5. The value of the vpDiff parameter
6. The value of the vDiff parameter
7. The value of the vInheritance parameter
8. The ID of the simulation (between 1 and the value of the nbSimulation)
9. The file extension ".rds"

An example of a valid file name: methylGR_S1_1_3_0.9_0.8_0.5_1.rds

#### The raw methylation data in GRanges

#### The third sample of the first generation

#### The fourth sample of the third generation


### Information about controls and cases ("treatment" prefix)

The second type of files contains a numeric vector denoting controls and cases (controls = 0 and cases = 1). One file is generated for each entry in the vNbSample vector parameter.

The file name is composed of those elements, separated by "_":

1. The string "treatment"
2. The code of the simulation (the fileID parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the nbSynCHR)
4. The number of controls as specified by the vNBSample parameter
5. The file extension ".rds"

An example of a valid file name: treatment_S1_1_3.rds

#### The information about controls and cases

#### 0 = control, 1 = case, length = number of samples


## Files related to saveMethylKit parameter

When saveMethylKit is TRUE, one extra file is saved for each generation:

1. Raw methylation data in methylRaw format ("methylObj" prefix)

### Raw methylation data in methylRaw format ("methylObj" prefix)

The file contains the raw methylation information from the simulated dataset formated into S4 methylRaw objects using r Biocpkg("methylKit") package. All samples related to the same generation are contained in a S4 methylRawList object that is present inside a list. The length of the list corresponds to the number of generations. The generations are stored in order (first entry = first generation, second entry = second generation, etc..). The S4 methylRawList object contains two Slots:

1. treatment: A numeric vector denoting controls and cases.
2. .Data: A list of methylRaw objects. Each object stores the raw methylation data of one sample.

There is one file per simulation. The file name is composed of those elements, separated by "_":

1. The string "methylObj"
2. The code of the simulation (the fileID parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the nbSynCHR)
4. The number of controls as specified by the vNBSample parameter
5. The value of the vpDiff parameter
6. The value of the vDiff parameter
7. The value of the vInheritance parameter
8. The ID of the simulation (between 1 and the value of the nbSimulation)
9. The file extension ".rds"

An example of a valid file name: methylObj_S1_1_3_0.9_0.8_0.5_1.rds

## The raw methylation data

#### The third sample of the first generation

#### The fourth sample of the third generation


## Files related to runAnalysis parameter

When runAnalysis is TRUE, two extra files are saved for each simulation:

1. Methylation events present in multiple samples in methylBase format ("meth" prefix)
2. Differential methylation statistics in methylDiff format ("methDiff" prefix)

### Methylation events present in multiple samples in methylBase format ("meth" prefix)

The first file contains the simulated dataset formated with the r Biocpkg("methylKit") package into a S4 methylBase object. The transformation is made using the r Biocpkg("methylKit") functions: filterByCoverage(), normalizeCoverage() and unite(). Each simulation has it own file. Only sites having minimum reads alignment in all samples are present in the file.

The file name is composed of those elements, separated by "_":

1. The string "meth"
2. The code of the simulation (the fileID parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the nbSynCHR)
4. The number of controls as specified by the vNBSample parameter
5. The value of the vpDiff parameter
6. The value of the vDiff parameter
7. The value of the vInheritance parameter
8. The ID of the simulation (between 1 and the value of the nbSimulation)
9. The file extension ".rds"

An example of a valid file name: meth_S1_1_3_0.9_0.8_0.5_1.rds

#### The methylation events present in multiple samples

#### Information for all samples in the first generation


### Differential methylation statistics in methylDiff format ("methDiff" prefix)

The second file contains the result of the differential methylation calculation done on the simulated dataset. Each generation of the dataset is analysed separately using the calculateDiffMeth() function of the r Biocpkg("methylKit") package. A S4 methylDiff object is created for each generation and is stored in the file inside a list (first entry = first generation, second entry = second generation, etc...).

The file name is composed of those elements, separated by "_":

1. The string "methDiff"
2. The code of the simulation (the fileID parameter, ex: "S1")
3. The chromosome number (between 1 and the value of the nbSynCHR)
4. The number of controls as specified by the vNBSample parameter
5. The value of the vpDiff parameter
6. The value of the vDiff parameter
7. The value of the vInheritance parameter
8. The ID of the simulation (between 1 and the value of the nbSimulation)
9. The file extension ".rds"

An example of a valid file name: methDiff_S1_1_3_0.9_0.8_0.5_1.rds

#### The differential methylation statistics

The r Rpackage("methInheritSim") package generates simulated multigenerational DMS datasets. Several simulator parameters can be derived from real dataset provided by the user in order to replicate realistic case-control scenarios.
The results of a simulation could be analysed, using the r Rpackage("methylInheritance") package, to evaluate the power and the false discovery rate of an experiment design.