knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

```{css, echo=FALSE} div, code {overflow-x:auto !important}

# Introduction: What is ShinyÉPICo for?

shinyÉPICo is a web application based on [shiny](https://github.com/rstudio/shiny) and created to accelerate and to streamline the study of Illumina Infinium DNA methylation arrays, including the 450k and EPIC. For this purpose, shinyÉPICo uses, among others, the widely used minfi [@Aryee2014] and limma [@Ritchie2015] packages for the array normalization and the differentially methylated positions (DMPs) calculation, respectively. Moreover, shinyÉPICo also allows calculation of differentially methylated regions calculation (DMRs), based on the mCSEA package [@Martorell-Marugan2019].

It is conceived as a 'graphical pipeline' since, throughout the analysis, you can observe the result of the different options with multiple charts, and interactively modify decisions in current or prior steps. In addition, the application automates calculations that would otherwise take much more time and effort.

shinyÉPICo is a fully graphical application and the calculations performed in R are done entirely on the server-side. Therefore, the application can be used both on the local computer and through a web server, to which devices, such as computers, smartphones or tablets could be connected, without high RAM or CPU requirements and only a web browser as requirement.

The steps followed by the application includes the data importing, starting with iDAT files (Raw Illumina DNA methylation datasets), array normalization, quality control, and data exploring, and DMP and DMR calculation. Each step has multiple options and charts that we will describe in this manual.

## What do I need in order to use ShinyÉPICo? 

ShinyÉPICo can run in GNU/Linux, Windows, or macOS. The package dependencies are automatically installed with the package. 

Since the application allows to follow interactively all the analysis process, many objects have to be stored in RAM memory. Therefore, the application can be memory demanding, especially when trying to analyze a large number of samples at the same time. We recommend at least 12GB of RAM for smooth use of the application, but depending on the number of samples analyzed and whether they are EPIC or 450k, the needs may be lower or higher.

## How can I install ShinyÉPICo?


The package will be available soon from Bioconductor, where it can be installed following standard instructions:

```r
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("shinyepico")

Currently, you can install the application through GitHub using the 'remotes' R package:

install.packages("remotes")
library("remotes")
install_github("omorante/shinyepico", upgrade="always", dependencies = TRUE)

Dependencies and implementation

ShinyÉPICo implements internal functions to follow the array normalization, and the DMP/DMR calculations, including, for example, the differential of beta values calculation between the groups. For the main features of the application, the package dependencies are the following:

The complete list of the application dependencies are:

writeLines(unlist(strsplit(packageDescription("shinyepico")[["Imports"]],", ")))

For DMR calculation, additional suggested packages are required. You can use the application without installing this package, but you will not able to calculate DMRs. Moreover, for the array normalization, you will need to install annotation and manifest packages, for 450k or EPIC.

writeLines(unlist(strsplit(packageDescription("shinyepico")[["Suggests"]],", ")))

And, to run the application:

library("shinyepico")
set.seed(123)
run_shinyepico()

The function run_shinyepico has 4 parameters that can be modified:

As can be seen in the example, we used the set.seed function before starting ShinyÉPICo. We recommend always setting up the same seed, which can be any number, to avoid uncertainties and ensure that the results obtained are reproducible. Particularly, DMRs calculation, that rely on the mCSEA package, utilizes permutations to estimate the p.value. For that reason, some uncertainty is expected, and results can be a bit different every time you run the application. Using a seed avoids this problem, as you will obtain always the same result with the same seed.

ShinyÉPICo workflow

shinyÉPICO workflow is divided into tabs that summarize the different steps to follow:

While it is possible to go back to previous steps and change options at any point in the process, some buttons are disabled until the steps required to perform them are completed. For example, you will not be able to generate a heatmap if you do not complete the normalization of the arrays and the calculation of the DMPs. This design is intended to avoid errors and guide the user through the analysis steps.

In the next sections, we will discuss the different steps, the options, and chart interpretation. Moreover, we will use a public DNA methylation array dataset to show an example of application use.

Using ShinyÉPICO: an explanation of the options and an example with real data

In the following example, iDATs from a study using Illumina EPIC DNA methylation arrays will be used to illustrate the steps [@Li2020]. Although shinyÉPICo can be used with multiple groups and large cohorts, and in that case, iteration and automation are more useful, for this manual we have decided to use a minimal example that can be easily reproduced on almost any computer. Specifically, monocyte and monocyte-derived macrophages (produced with M-CSF) has been selected, only 6 samples (3 monocyte samples and 3 macrophage samples from 3 different healthy donors).

The .zip file of this dataset can be found in the Github repository

In this example, the seed 123 has been set up using the function set.seed() before run_shinyepico().

Data Import and Sample Selection

The first step in the shinyÉPICo workflow is to prepare the data in the proper format. iDAT files should be compressed in a .zip file. The name of the files should follow the standard convention: XXXXXXXXXXXX_YYYYYY_ZZZ.idat being XXXXXXXXXXXX the Sentrix_ID, YYYYYY the Sentrix_Position, and ZZZ Grn or Red (corresponding, respectively, to the Red and Green signal file).

Moreover, a CSV (comma-separated) file with the annotation of the experiment should be included. It is mandatory to include the Sentrix_ID and Sentrix_Position columns that allow the software to find their respective iDAT files. Moreover, other columns should be added to reflect the different variables (e.g. sample name, health/disease, treatment/control, age, sex, hybridization day, etc.). Usually, iDATs and sample sheet have these features by default, and no additional work is required.

In this regard, 3 parameters are required in the Input tab to continue the analysis:

An aspect to take into account is that shinyÉPICo autodetects the variable types (numerical or categorical) depending on whether or not they can be coerced to numbers. For this purpose, variables are coerced to a numeric vector, and when the generated NAs are less than 75% of the total, the variable is set as numeric, and, otherwise, as categorical. Moreover, not informative categorical variables are excluded (e.g., a variable with the same value in all the samples, or with not equal values). Therefore, numbers should not be used for categorical variables in the sample sheet. Numerical variables with some NAs but less than 75% can be used in the exploratory analysis but not as covariables of the linear model, since Limma needs a design matrix without missing values.

In the next table, you can see the sample sheet of the example dataset:

knitr::kable(read.csv("./Sample_Sheet.csv"))

It includes the Sentrix_ID and Sentrix_Position mandatory columns, and also other columns relevant for the experiment such as Sample_name, Sample_group, and donor.

After selecting the proper column in each form field, and the samples to be included (in this example, all), it is possible to proceed with the next step, pressing the "Continue" button. When the application is performing an operation expected to take time, a progress bar is displayed at the bottom right of the window to inform the user that the application is busy.

Internally, shinyÉPICO uses the "read.metharray.sheet" and "read.metharray.exp" functions to read the sample sheet and load the DNA methylation data, respectively.

After loading the methylation data of the selected samples, the Normalization tab is automatically shown, where you can see an overview of the data quality and perform the array normalization.

Quality control charts

First, the quality control tab shows two useful charts to identify bad samples.

On the one hand, the QC Signal plot shows the median methylated (mMed) and unmethylated (uMed) of each sample array. When mMed or uMed of a sample is less than 10, it is considered "Suboptimal". Although this cutoff is arbitrary and it is the user who decides whether a sample is valid or not, a signal much lower than this threshold or very different from most samples indicates that there has been a problem with it. Depending on whether the signal is very far from this threshold and depending on the rest of the results, these types of samples should be excluded from the analysis. To exclude samples, you can go back to the Input tab to change the samples selected and load them again.

On the other hand, the Bisulfite conversion plot is calculated using the information of the bisulfite conversion II control probes of the 450k/EPIC arrays. When the bisulfite conversion reaction is successful, the probe of the control position will have intensity in the Red channel, whereas if the sample has unconverted DNA, the probe will have a high signal in the Green channel.

For each sample, we calculate all the Red/Green ratios for each control position, and the minimum ratio is shown in the chart. When the ratio of a sample is lower than 1.5, we flag it as "Failed".

Array Normalization

A correct array normalization is essential for the results of the subsequent steps in the workflow. shinyÉPICo can use all the normalization methods available in the minfi package:

For details about the methods, the minfi documentation and the respective publications are good references. Although depending on the experimental design and the type of changes expected a normalization method can be expected to be better than another, usually the best method should be determined empirically among the valid options.

Additionally, shinyÉPICo offers three more options in the Normalization tab:

Moreover, Using the minfi recommended value, the genomic positions of the final obtained RGChannelSet object (Raw Data) are filtered by the detection p.value (using the minfi::detectionP function) in the normalized data, removing all the positions with an average p.value higher of 0.01.

shinyÉPICo simplify the testing of different methods and options. When a normalization method is selected, clicking the "Select" button, the different charts are automatically generated.

In this example, we are going to select the Quantile normalization. We show different examples of the type of charts generated by the software.

Density plot

This chart shows the distribution of the beta values of every sample. A bimodal distribution is expected, with two peaks centered around 0 (unmethylated) and 1 (methylated) beta values. When more peaks are shown in the middle, or when the pattern of different samples is very different, it indicates problems in some step of the sample preparation or hybridization. After normalization, we expect an improved alignment of the sample patterns, with little discrepancies between them.

Raw:

Density plot of raw beta values{width=90%}

Quantile:

Density plot of quantile normalized beta values{width=90%}

Boxplot

This chart shows the box-and-whisker plot of beta values in each sample. The median and interquartile range of the samples should be similar, and more homogeneous after the normalization.

Raw:

Boxplot of raw beta values{width=90%}

Quantile:

Boxplot of quantile normalized beta values{width=90%}

SNPs heatmap

This heatmap utilizes information of a special subset of single nucleotide polymorphism (SNP) probes in the 450k and EPIC arrays. For the representation, we use the beta values of these probes (extracted using the function minfi::getSnpBeta()). The heatmap shows a column clustering, that should correspond with the donor information. For example, in this experiment, with 3 different treatments and 3 donors, we can see a clear clustering by donor.

Heatmap of the SNP probes beta values

Sex prediction

Depending on the median X and Y chromosomes intensities, we show the sex prediction generated with the function minfi::getSex(). Note that, even if the Drop X/Y Chr. option is selected, this prediction is done before that step, and this graph will not be altered.

In this example, the three donors are males, as can be observed in the chart.

Median X chromosome intensity vs Median Y chromosome intensity{width=65%}

Exploratory analysis: PCA and Correlations

In order to get a first overview of the data before the DMP/DMR calculation, we provide two more charts.

First, we show the principal component analysis (PCA). Selecting specific principal components to plot in the x and y axis is possible, as well as changing the color variable. Moreover, a table showing the percentage of variance explained in each principal component is also depicted.

Principal Component 1 vs Principal Component 2

Secondly, we also provide a heatmap of the correlations between principal components and variables. As we explained above, variable types (categorical or numerical) are guessed, and not informative variables are excluded.

In the correlations tab, a detailed table of the variables is shown, including the type (categorical variables are shown as "factor", numerical variables as "numeric", and discarded variables as "discarded").

Pearson correlation is applied to correlate principal components with numerical variables. For categorical variables, linear models (principal component ~ categorical variable) are generated and R-squared statistics are shown in the representation. Alternatively, the p.values associated with these statistical approaches can also be plotted in a heatmap.

Both PCA and correlations plots can be also useful to find possible variables affecting methylation data, in order to select covariates and interactions for DMP/DMR calculation.

Correlation value:

Correlation values of filtered variables with principal components P value:

Correlation values of filtered variables with principal components

Differentially Methylated Positions (DMP) calculation

After normalizing the methylation data, the next tab, "DMPs", will become enabled. The process of DMP calculation is divided into two parts: model generation and contrasts calculation.

Model generation

In order to calculate DMPs between groups of the variable of interest, shinyÉPICo uses the limma package. First, shinyÉPICo fits a linear model using the M values, with the limma::lmFit function (see the limma user guide for more details.)

Beta values are used in all the charts of the application, because of its easier biological interpretation: Beta values range from 0 (totally unmethylated) to 1 (totally methylated).

However, for the statistical analysis with Limma, shinyÉPICo utilizes the M values (calculated as the logit of Beta values). M values range from -Infinite to Infinite.

Beta values are not as suitable to use with limma, because they have severe heteroscedasticity for highly methylated or unmethylated positions. For that reason, M values (homoscedastic data) are used for the limma model, whereas Beta values are used for representation [@Du2010].

Several options can be set in this step:

When the linear model is generated, a diagnosis chart is plotted. It represents the square root of the standard deviation versus the average log expression of each position. We would expect that, as seen in the example, there is a flat relationship between both variables.

Mean vs Variance relationship of the linear model

Moreover, the design matrix used to generate the model is also shown:

knitr::kable(data.frame(MAC = c(1,0,1,0,0,1), MO = c(0,1,0,1,1,0), DonorB = c(0,1,0,0,0,1), DonorC=c(0,0,1,0,1,0)))

Contrasts calculation

ShinyÉPICo autodetects all the possible comparisons between the groups found in the variable of interest. The number of pairwise combinations between the different groups is calculated as follows: n! / (2! (n ― 2)!), where n is the number of groups.

Then, the application iterates over the contrasts and calculates the tables with the statistics produced by limma, using the functions limma::contrasts.fit, limma::eBayes and limma::topTable. Two more options can be set in this section:

Moreover, shinyÉPICo also calculates the differential of betas between groups, for each position. This information will be used in the next steps.

Heatmap customization

shinyÉPICo provides plenty of options to filter the statistics produced by limma and to generate a custom heatmap.

Filtering options

With these options, a summary table with the DMPs found in each contrast is shown. In this example:

knitr::kable(data.frame(contrast = "MAC-MO", Hypermethylated=2275, Hypomethylated=21, total=2296))

The differential of beta is always calculated as the subtraction of the average value from the first group to the second group (in this example, MOavg - MACavg) and the positions are assigned to the "Hypermethylated" group if the differential is greater than 0, and to the "Hypomethylated" group if the differential is lower than 0.

Group options

Data options

Clustering options

Additional options

Since heatmap plotting can be very time- and memory-consuming, we have fixed a limitation of 12.000 positions (rows) in the heatmap. If you try to plot more positions, you will see a message with this information. However, if you need a bigger heatmap, you can download the information to generate it in the Export tab, as we will discuss below

DMP heatmap generated with default options{width=70%}

DMPs Annotation

In the next tab of the DMPs section, DMPs annotation, the information about the genes associated with the DMPs is provided in a table. Moreover, the boxplot of a selected position can be plotted.

DMP associated with the A2M gene promoter, found in the MAC vs MO contrast. It shows demethylation in MAC.{width=80%}

The information on the table can be downloaded in CSV or XLSX (Excel) formats, using the buttons above it. However, only rows shown can be downloaded. It is necessary to display all the information in order to download it.

Differentially Methylated Regions (DMR) calculation

In addition to DMP calculation, shinyÉPICO also supports DMR calculation. A DMR is defined as a genomic region, including several positions, with differential methylation between groups. Instead of checking individual changes in genomic positions, DMR calculation involves the comparison of groups of genomic positions, which can be clusterized in several ways depending on the specific algorithm used.

DMR calculation of shinyÉPICO is based on the package mCSEA which searches DMRs in predefined regions: promoters, gene bodies and CpG islands.

mCSEA uses an algorithm based on the gene set enrichment analysis [@Subramanian2005]. Instead of genes, mCSEA uses a sorted list of genomic positions compared between two conditions.

Therefore, we use the eBayes result of limma of each contrast, sorted by the t-statistic, as the input of mCSEA. This implies that the parameters introduced in the limma model, including variables and covariates, are also taken into account for the calculation of DMRs, without the need of additional calculations.

mCSEA options

In the DMRs tab, some options can be set for the calculation generated by the mCSEA package:

Heatmap customization

In the DMR section, shinyÉPICo provides the same options to filter the data and customize the data as in the DMP section.

Note that the default differential of beta threshold for DMRs is 0, in contrast with DMPs (0.2). This is justified considering that DMRs are calculated using the information of the methylation data of several positions, and that small changes may also be relevant if they are consistent. With these options, a summary table with the DMRs found in each contrast is shown. In this example:

knitr::kable(data.frame(contrast = rep("MAC-MO",3), Hypermethylated=c(64,160,45), Hypomethylated=c(78,111,62),total=c(142,271,107)))

In order to calculate the differential of beta for DMRs, all the positions in DMRs are considered, and the average difference is used. The differential of beta is always calculated as the subtraction of the average value from the first group to the second group (in this example, MOavg - MACavg) and the positions are assigned to the "Hypermethylated" group if the differential is greater than 0, and to the "Hypomethylated" group if the differential is lower than 0.

All the options of the DMP heatmap are also in the DMR heatmap. The explanation of the options can be found in the DMPs section.

DMR heatmap generated with default options{width=70%}

Single DMR plot

In the Single DMR plot tab, a table with the DMRs of a selected contrast and region is shown. A DMR can be selected to be plotted in its genomic context.

Genomic plot of the A2M promoter DMR. It shows demethylation in MAC

Exporting results

In the Export tab, shinyÉPICo allows the user to download the results of the analysis. Initially, all the download buttons are disabled, and they only become enabled when the information to generate the objects is available.

R Objects

Up to 9 R Objects can be download, selecting them in the form. They are generated in RDS format, using the saveRDS function. Therefore, they can be easily read in R using the function readRDS. We are going to describe the objects:

Filtered bed files

DMPs or DMRs genomic positions can be downloaded in BED format. The list can be generated with the hypomethylated and hypermethylated lists of DMPs/DMRs in each contrast, or with the heatmap clustering (if the Row Colors option is enabled). Additionally, a list with the background (all the genomic positions or genomic regions tested) is also generated.

Moreover, the genome can also be selected (hg19 or hg38 genome). Since 450k and EPIC arrays are annotated in hg19, when hg38 is selected, shinyÉPICo uses the function rtracklayer::liftOver to convert the genomic coordinates. It should be note that, as a result of this process, some positions may not be converted, and will not appear in the final lists.

This option is convenient to produce some typical analysis before the DMP/DMR calculation, such as Gene Ontology Over-representation (with, for example, GREAT website) or motif enrichment analysis (with, for example, the HOMER application).

Workflow Report

An HTML file with all the options selected during the analysis, and most tables and charts produced, can be generated and downloaded. This report is intended as a reference that indicates all the details of the analysis that can be consulted at any time and reproduced again in shinyÉPICo if necessary.

Custom R Script

This option creates a custom R script, including the parameters selected in the analysis, the main functions that shinyÉPICo has used to produced the results and the pipeline followed, from the data importing to the DMP/DMR calculation, DMP/DMR filtering and heatmap(s) generation. The user only has to select the proper folder with the unzipped data used in the analysis (sample sheet and iDAT files), and running the script will produce the same objects as shinyÉPICo.

Heatmap(s)

Both DMP and DMR heatmaps can be downloaded in PDF format. This high quality vector format allows the user to modify heatmaps using software such as Inkscape or Adobe Illustrator.

Session info

sessionInfo()

References



omorante/shinyepico documentation built on May 11, 2023, 7:21 p.m.