DNA methylation is a well-studied epigenetic mark on the genome. This modification is used in epigenome-wide association studies promising improvement to our understanding of many diseases, profiled by DNA methylation arrays -the Illumina Infinium 450K and EPIC(850K) methylation arrays are the most common ones- with high-quality results and large sample sizes, coupled with decreasing costs and increasing coverage of DNA methylation microarrays. There are various software tools (minfi, methylumi, wateRmelon, bigmelon,..) and statistical methods to analyse pipelines and to process microarray data based on the R and Bioconductor. Each package contains many tools for a more in-depth analysis. MelonProbes is one of those packages specialised for filtering bad and low-quality probes for potential problems. MelonProbes is being developed using S4 object-oriented programming which enables user-friendly, flexible and well implemented tools. MelonProbes aims to filter probes by their bead counts, beta values, detection p-values and other parameters.
DNA methylation is defined as the addition of methyl group (CH~3~) to a CG dinucleotide of the DNA (Jones, 2012) and widely studied under epigenomic wide associated studies due to its importance on development and diseases. The most frequent type of DNA methylation in humans is 5-methylcytosine (5mC) which affects 70-80% of Cytosines followed by Guanines (CpGs) in the genome (Ehrlich et al., 1982). It can be analysed in several ways but the Illumina HumanMethylation450 array and MethylationEPIC array dominate the epigenetic community. HumanMethylation450 and MethylationEPIC arrays contain more than 450,000 and 850,000 probes, respectively. EPIC array covers more than 90% of 450K array content and has a comprehensive genome-wide coverage (Pidsley et al., 2016).
These arrays use two different types of chemical assays; Infinium I and Infinium II. The Infinium I and II probes are 50 base pair in length which detect methylation of bisulfite treated DNA with different mechanisms. Infinium I array uses paired probe technology designed for each CpG, one for methylated and the other one for an unmethylated sequence. On the other hand, the Infinium type II array technology consists of only one probe with red or green fluorescence-labelled single base extension pairing with cytosine or thymidine of the hybridized genomic DNA molecule. The Infinium I array approach is a single-channel microarray. However, the Infinium II assay has a two-colour readout. The difference in chemistries of Infinium type I and II probes need to be taken into consideration during data processing. Therefore, it is strongly recommended to filter out probes which are cross-reactive probes or common single nucleotide polymorphisms (SNPs) containing probes or displaying high average intensities during data preprocessing.
The percentage of 8.6-25 of the Infinium HumanMethylation450 probes have been classified as non-specific or cross-reactive (Price et al., 2013; Zhang et al., 2012). These probes are problematic because they are likely to co-hybridize with non-specific CpG sites and tend to result in incorrect methylation measurements.
Some (4.3%) of Infinium HumanMethylation450K probes overlap common single nucleotide polymorphisms on the genome. Genetic polymorphism of an underlining sequence can affect signal intensities of arrays (Chen et al., 2013). Arrays are likely to represent a genotype of a sample rather than the actual DNA methylation level of targeted CpG sites. Therefore, to filter out probes containing common SNPs is important for array analysis.
Both cross-reactive and SNPs containing probes can cause potential invalid and spurious results. However, there are other probes that can cause problems. Some probes have high-intensity of β-values (defined as a ratio of methylated signal over total signal of methylated+unmethylated+100) which is close to 0.5. These probes have a tendency to result inconsistent DNA methylation levels when compared with other approaches (Dedeurwaerder et al., 2014). It is suggested that probes representing high intensities from any parameter should be filtered out before downstream analysis. To do that, we have used Pidsley’s (Pidsley et al., 2016) cross-reactive and SNP overlapping probe lists to filter problematic probes based on their β-values, detection p-values and bead counts. MelonProbes is designed as user-friendly and easy to manipulate.
The input data to this package can be RGChannelSet, RGChannelsetExtended or Methylumiset class, or β-values.
1.Installation
For installation of the melonProbes package, the devtools package should be installed. You can install both devtools and melonProbes packages by using the following commands:
if(!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools") devtools::install_github("gizemaliskan/melonProbes")
2.Depends
melonProbes package depends on minfi, methylumi,wateRmelon, Biobase, and methods packages.
3.Trying it out
library(melonProbes)
It is the only function of the melonProbes package. It has a function to test Illumina HumanMethylation BeadChip microarray probes for potential issues. The testProbes function can works with RGChannelSet, RGChannelsetExtendedand Methylumiset classes, and β-values.
#Generic Function testProbes(betas, manifest = c("450k", "EPIC"), beadcounts = NULL, detection = NULL, nb = 0.2, np = 0.2, nvar = 0.5, ot, nbCount = 3, nbThresh = 0.05, pvCount = 0.05, pvThresh = 0.01, nvarThresh = 0.05)
library(minfiData) data("RGsetEx") #RGChannelSet class dataset from minfiData test_RG <- testProbes(RGsetEx)
No beadCounts available 144 of 4998 have a detection p > 0.05 in more than 1% of samples. 131 of 2917 are ranked in the bottom 5th percentile of Type I probes. 131 of 6951 are ranked in the bottom 5th percentile of Type II probes
head(test_RG)
DetectionP Variation cg00050873 FALSE FALSE cg00212031 FALSE FALSE cg00213748 FALSE FALSE cg00214611 FALSE FALSE cg00455876 FALSE FALSE cg01707559 FALSE FALSE
data("melon") #Sample methylumiSet class dataset from wateRmelon package betas <- betas(melon) #extracting beta values from melon object test_betas <- testProbes(betas, ot = ot)
0 of 18 are ranked in the bottom 5th percentile of Type I probes. 0 of 61 are ranked in the bottom 5th percentile of Type II probes. #how many type I and type II show low variation
head(test_betas)
Variation cg00000029 FALSE cg00000108 FALSE cg00000109 FALSE cg00000165 FALSE cg00000236 FALSE cg00000289 FALSE
```{R, eval= FALSE} library(minfiData) library(melonProbes)
toydat <- RGsetEx.sub
class(toydat) <- "RGChannelSetExtended"
nbeads <- matrix(as.integer(rnorm(1938,mean=5, sd=1)), 1938,6)
nbeads[nbeads<0] <-0
assays(toydat, withDimnames= FALSE)$NBeads <- nbeads
test_ToyData <- testProbes(toydat)
7 of 25 have a beadcount <= 3 in more than 5% of samples.
148 of 250 have a detection p > 0.05 in more than 1% of samples.
0 of 0 are ranked in the bottom 5th percentile of Type I probes.
0 of 0 are ranked in the bottom 5th percentile of Type II probes.
```r
head(test_ToyData)
DetectionP Variation cg00050873 FALSE FALSE cg00212031 FALSE FALSE cg00213748 FALSE FALSE cg00214611 FALSE FALSE cg00455876 FALSE FALSE cg01707559 FALSE FALSE
library(melonProbes) library(wateRmelon) data("melon") #sample methylumiSet class dataset from wateRmelon package test_melon <- testProbes(melon)
No beadCounts available 0 of 18 are ranked in the bottom 5th percentile of Type I probes. 0 of 61 are ranked in the bottom 5th percentile of Type II probes.
test_melon
Object Information:
MethyLumiSet (storageMode: lockedEnvironment)
assayData: 6 features, 12 samples
element names: Avg_NBEADS_A, Avg_NBEADS_B, BEAD_STDERR_A, BEAD_STDERR_B, betas, Intensity, methylated, pvals, unmethylated
protocolData: none
phenoData
sampleNames: 6057825008_R01C01 6057825008_R01C02 ... 6057825008_R06C02 (12 total)
varLabels: sampleID label sex
varMetadata: labelDescription
featureData
featureNames: cg00000029 cg00000108 ... cg00000289 (6 total)
fvarLabels: TargetID ProbeID_A ... Variation (39 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
Major Operation History:
submitted finished command
1 2012-10-17 14:23:16 2012-10-17 14:23:20 methylumiR(filename = "fr2.txt")
2 2012-10-17 17:11:19 2012-10-17 17:11:20 Subset of 46 samples.
3 2012-10-17 17:11:48 2012-10-17 17:11:48 Subset of 12 samples.
4 2020-09-02 09:49:49 2020-09-02 09:49:49 Filtered by testProbes method (melonProbes)
5 2020-09-02 09:50:01 2020-09-02 09:50:03 Subset of 6 features.
Attention We strongly recommend melonProbes users modify the testProbes function parameters according to their data's nature for a more comprehensive and precise analysis.
Changing the function behaviour The testProbes function has highly flexible and easily adaptable nature. It naturally allows multiple behaviours under one function. In the following examples, the testProbes function shows different behaviours.
test <- testProbes(toydat)
7 of 25 have a beadcount <= 3 in more than 5% of samples. 148 of 250 have a detection p > 0.05 in more than 1% of samples. 0 of 0 are ranked in the bottom 5th percentile of Type I probes. 0 of 0 are ranked in the bottom 5th percentile of Type II probes.
test <- testProbes(toydat, np=0, nb=0, nvar=0) #test all probes in dataset, not only ones we consider to be bad
83 of 553 have a beadcount <= 3 in more than 5% of samples. 258 of 590 have a detection p > 0.05 in more than 1% of samples. 15 of 262 are ranked in the bottom 5th percentile of Type I probes. 15 of 153 are ranked in the bottom 5th percentile of Type II probes.
test <- testProbes(toydat, nbCount=0, nbThresh=0, pvCount=0, nvarThresh=0) #flag probes selected by nb, np, and nvar
0 of 25 have a beadcount <= 0 in more than 0% of samples. 179 of 250 have a detection p > 0 in more than 1% of samples. 0 of 0 are ranked in the bottom 0th percentile of Type I probes. 0 of 0 are ranked in the bottom 0th percentile of Type II probes.
Betas It is the beta values containing matrix for generic function but it can be methylumiSet and raw data objects from minfi either RGChannelSet or RGChannelSetExtended.
Manifest Specify which array is used
Bead counts Microarrays contain thousands of beads and each bead is coated with multiple copies of an oligonucleotide probe which DNA sequences can hybridize. Increasing the number of beads in arrays also results in more reliable signals from that probe. Therefore, in practice, any signal lower than a certain threshold or if a signal has lower proportion compared to the total number of samples, it means that probe has failed. Default parameter is NULL which contains bead counts. Bead counts values are used in calculation of nb.
Detection p-values represent an error of signal obtained from a probe to the background signal. Bead counts and detection p-values in the testProbes function are set to NULL. These values are going to be used in calculations of np parameters.
nb is the percentage of dataset a probe has failed in bead counts. The threshold for nb is 0.2 meaning that probes failed in more than 20% of datasets will be filtered out.
np is the percentage of dataset a probe has failed in detection p-value. Probes failed in more than 20% of datasets in terms of detection p-values will be filtered out.
nvar sample variance is used to filter out low variance having probes. Normally, a variation of probes in the CpG site between samples is expected. Certain CpG sites have high variation in some diseases and different tissue types and low variation on these sites need to be explored. Nvar parameter is adjusted to 0.5 which means probes showing a low variation for more than 50% of the time are removed. The message stated “# of # are ranked in the bottom 5th percentile of type I/II probes.” clarifies how many type I and type II probes were determined within the 5%.
ot is a character vector representing the types of probes as I or II.
nbCount refers to physical bead counts to test probes determined as lower than 3.
nbThresh is a proportion of the sample. The TestProbes function filters probes when the nbCount is less than 3 (< 3) and nbThresh > 5% of the samples.
pvCount is the detection p-value to test probes. Default is 0.05 (5%).
pvThresh is the detection p-value threshold, corresponding to > 0.01 value. Probes are failed when they show > 0.05 detection p-value in > 1% of samples. For the detection p-value containing objects, the testProbes function shows a message of "# of # have a detection p > 0.05 in more than 1% of samples." to inform the investigators about how many probes failed in the detection p-value.
nvarThresh is the density of variation a probe can be in.
The testProbes function produces a logical list of analysis representing which probes failed in and the list is stored differently in different object classes. TRUEs are for failed probes.
Beta values: when the input data is beta values, the return object is a logical list of probes.
RGChannelSetExtended: the return is the list of both testProbes and detection P value analysis.
MethylumiSet: the return is methylumiSet class object. Analysis is stored under featureData@data$Variation.
RGChannelSet: if input data class is RGChannelSet, the return object is a logical list of testProbes and detection P value analysis.
The testProbes function does not filter out failed probes automatically. The users can remove bad or problematic probes by subsetting the object. Here, there is an example of removing problematic probes:
data("melon") #sample dataset from wateRmelon package test <- testProbes(melon) out <- test[which(test@featureData@data$Variation==FALSE),]
The melonProbes package does not complete its developing process. Few improvements will be covered soon. First, returns of the testProbes function will be developed and new return options will be coded. Second, probes will be checked whether they are in HW equilibrium or not. Finally, the testProbes function will be adapted to the bigmelon package for large datasets.
Chen, Y. A., Lemire, M., Choufani, S., Butcher, D. T., Grafodatskaya, D., Zanke, B. W., Gallinger, S., Hudson, T. J. and Weksberg, R. (2013) Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics, 8, 203-9.
Ehrlich, M., Gama-Sosa, M. A., Huang, L. H., Midgett, R. M., Kuo, K. C., McCune, R. A. and Gehrke, C. (1982) Amount and distribution of 5-methylcytosine in human DNA from different types of tissues of cells. Nucleic Acids Res, 10, 2709-21.
Jones, P. A. (2012) Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet, 13, 484-92.Chen, Y. A., Lemire, M., Choufani, S., Butcher, D. T., Grafodatskaya, D., Zanke, B. W., Gallinger, S., Hudson, T. J. and Weksberg, R. (2013) Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics, 8, 203-9.
Dedeurwaerder, S., Defrance, M., Bizet, M., Calonne, E., Bontempi, G. and Fuks, F. (2014) A comprehensive overview of Infinium HumanMethylation450 data processing. Brief Bioinform, 15, 929-41.
Pidsley, R., Zotenko, E., Peters, T. J., Lawrence, M. G., Risbridger, G. P., Molloy, P., Van Djik, S., Muhlhausler, B., Stirzaker, C. and Clark, S. J. (2016) Critical evaluation of the Illumina MethylationEPIC BeadChip microarray for whole-genome DNA methylation profiling. Genome Biol, 17, 208.
Price, M. E., Cotton, A. M., Lam, L. L., Farré, P., Emberly, E., Brown, C. J., Robinson, W. P. and Kobor, M. S. (2013) Additional annotation enhances potential for biologically-relevant analysis of the Illumina Infinium HumanMethylation450 BeadChip array. Epigenetics Chromatin, 6, 4.
Zhang, X., Mu, W. and Zhang, W. (2012) On the analysis of the illumina 450k array data: probes ambiguously mapped to the human genome. Front Genet, 3, 73.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.