knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Data Processing and QC

  1. Data loading

  2. Checking Internal Controls

  3. Checking Probe Quality

  4. Data normalization (BMIQ)

  5. Probe filtering

Data prediction

Example

Introduction

UniD is short for Unified Diagnostic Platform of gliomas, which is designed to predict other genomic information using Illumina Infinium DNA methylation microarray data. This package includes two section of function: data processing QC and genomic information prediction.

In the data processing and quality control section:

In the genomic information prediction section: DNA methylation data generated from both HumanMethylation 450k and HumanMethylation EPIC platform can be used for prediction. With DNA methylation data, UniD.pred() can predict:

Back to Top

Data processing and QC

1. Data loading

Raw .idat files should be stored in one folder with a .csv file which includes the information of sample, their corresponding sentrix ID and sentrix position. This sample sheet should be saved in specific format with , as the delimiter. For example (as shown in textEdit):

[Header],,,,,,
Investigator Name, Jie Yang,,,,,
Project Name, Example,,,,,
Experiment Name, test,,,,,
Date,01-01-1990,,,,,
[Data],,,,,,
Sample_Name,Sample_Well,Sample_Plate,Sample_Group,Pool_ID,Sentrix_ID,Sentrix_Position
sample1,,,,,1001001001,R01C01
sample2,,,,,1001001001,R02C01
sample3,,,,,1001001001,R03C01
sample4,,,,,1001001001,R04C01

A list with all necessary information will be created using UniD.load(). For example:

loading <- UniD.load(dataDir = "~/Desktop/IDAT/", outDir = "~/Desktop/output/",
arrayType = "450k", write = T)

Back to Top

2. Checking Internal Controls

Different categories of internal control were hold within each beadchip. Based on the criteria provided in Beadchip Control Reporter (BACR), each following categories will be checked and some of them have more than one probes:

However, it is not recommended to excludes sample which failed any categories. However, they could obviously be used to check the experiment quality especially when something goes wrong. A table with all the sample QC for those categories will be returned with function UniD.intctl(), as shown in example:

samQC <- UniD.intctl(loading = loading, dataDir = "~/Desktop/IDAT/",
outDir = "~/Desktop/output/", arrayType = "450k", sampleType = "FFPE",
write = T)

Back to Top

3. Checking Probe Quality

Probes will be checked for detection P-value detP and beadcount bc in a sample-wise manner which means each sample are independent when being evaluated.

The detection P-value is calculated by comparing each probes to the general backgroun probes. Probes failed when they have a detP > 0.05 which means they are not significant detectable comparing to the background signal. The probe will be set as missing value. The beadcount is the number of bead for for probes. If the bc < 3, that specific probe is failed and will be set as missing value for following analysis.

Beta.raw will be generated with the function UniD.dataqc with columns as samples and rows as probes. Missing values were included and data was represented using Beta value.

Beta.raw <- UniD.dataqc(loading = loading, outDir = "~/Desktop/output/",
detP.cut = 0.05, bc.cut = 3, arrayType = "450k", write = T)

Back to Top

4. Data normalization (BMIQ)

Illumina HumanMethylation 450k and HumanMethylation EPIC platform have both Type I and Type II probes on the microarray while HumanMethylation 27k. Due to the chemical differences between the two types of probes exist in the beadchip, Type I and Type II probes have different value distribution. The difference should be OK if all samples were run on the same platform. However, if we want to compare the different samples between HM450k and HM27k, data should be normalized before any comparison (especially because many samples from TCGA were run on both HM27k and HM450k).

Many normalization method is available right now, including Beta-mixture quantile normalization (BMIQ), subset-quantile with array normalizaton SWAN, peak-based normalization PBC, functional normalization, and so on. Because this package is focus on the genomic information prediction, therefore, we only provide one normalization method BMIQ here, the gene expression subtype prediction need the specific BMIQ normalized data as input.

This normalization step is recommended to be processed before the probe filtering (UniD.probefilter). Because the normalization depends on the beta value distribution of all Type I and Type II probes exist. If the probe filtering processed first, a lot of probes may be filtered out and the normalization process will be easily biased.

Beta.BMIQ <- UniD.BMIQ(Beta.raw, outDir = "~/Desktop/output/",
detP.cut = 0.05, bc.cut = 3, arrayType = "450k", write = T)

Back to Top

5. Probe filtering

After we generated Beta.raw and Beta.BMIQ, we could use the UniD.probefilter to do the probe filtering. Probes can be optionally filtered out if they belonging to the following categories. Each filter is controled by one argument in the function of UniD.probefilter().

Beta clean <- UniD.probefilter(Beta.raw, outDir = "~/Desktop/output/",
filterXY = T, filterSNPHit = T, filterMultiHit = T, filterNonCG = F,
filterNonEpic = T, arrayType = "450k", filterSample = T, filterSample.cut
= 0.1, filterProbe = F, write = T)

After UniD.probefilter, the Beta.clean is ready for use as data frame.

Back to Top

Data prediction

All prediction models were compiled within one function UniD.pred(). Each argument control the prediction for each genomic information as shown below:

However, what need to be emphasize here is we have two different input data sets. For Pred.ExpressSubtype, we need to use the Beta.BMIQ which is the beta value normalized by BMIQ method. For all other predicton model, Beta.clean or Beta.raw is OK.

Back to Top

Examples

The example data set generated from EPIC platform is downloaded from the Illumina website, the demo data set https://support.illumina.com/array/array_kits/infinium-methylationepic-beadchip-kit/downloads.html. If processing the 450k data, the general data processing workfow is the same as EPIC data set, but need to change the arrayType to 450k and may set the filterNonEpic as TRUE.

0. One step function

For usage convience, all the functions can be called one by one with one function. In this function, all arguments will use the default values. Use with careful.

pred <- UniD(dataDir = "~/Desktop/test_450k/", outDir = "~/Desktop/test_450k/", arrayType = "450k", write = T, sampleType = "other")

1. Data loading

loading <- UniD.load(dataDir = "./inst/extdata/",
                     outDir = "./data/", arrayType = "EPIC",
                     write = T)
#R log
====Data Loading Start====
Package version of UniD is: 0.0.0.9000
Loading data from "./inst/extdata/"
[read.metharray.sheet] Found the following CSV files:
[1] "./inst/extdata//sample_sheet_EPIC.csv"
loading(rgSet, Mset, detP) saved as: ./data//loading.Rdata
====Data Loading Finsihed====

2. Checking Internal Controls

samQC <- UniD.intctl(loading = loading, dataDir = "./inst/extdata/",
                     outDir = "./data/", arrayType = "EPIC",
                     sampleType = "other", write = T)
#R log
====Internal Control Checking Start====
Failed sample in Internal Control:

Restoration_green :
staining_green : 
staining_red : 
extension_green : 
extension_red : 
hybridization_green_HM : 
hybridization_green_ML : 
targetRemoval_green1 : 
targetRemoval_green2 : 
bisulfite_conversion_I_green1 : 
bisulfite_conversion_I_green2 : 
bisulfite_conversion_I_red1 : 
bisulfite_conversion_I_red2 : 
bisulfite_conversion_II_1 : 
bisulfite_conversion_II_2 : 
specificity_I_green : 
specificity_I_red : 
specificity_II_1 : 
specificity_II_2 : 
nonpolymorphic_green : 
nonpolymorphic_red :

Internal Controls data save as: ./data//samQ.Rdata
Internal Controls checking results saved as: ./data//samQC_Internal.Control.csv
====Internal Control Checking Finished====

3. Checking Probe Quality

Beta.raw <- UniD.dataqc(loading = loading, outDir = "./data/",
                        detP.cut = 0.05, bc.cut = 3, arrayType = "EPIC", write = T)
#R log
====Sample-wise Probe Quality Assessment Start====
[1] "The failed fraction per sample (failed detP and bc may overlap): "
        Fail.Frac.detP Fail.Frac.beadcount Fail.Frac.NA
Sample1    0.001067099         0.001690054  0.002669478
Sample2    0.001085557         0.001621991  0.002609490
Sample3    0.000711784         0.001390113  0.002070749
The raw Beta value were saved as: ./data//UniD_Beta.raw.Rdata
The fraction of failed probes per sample saved as: ./data//Failed_probe_Fraction.csv
====Sample-wise Probe Quality Assessment Finished====

4. Data normalization (BMIQ)

Beta.BMIQ <- UniD.BMIQ(Beta.raw = Beta.raw, arrayType = "EPIC",
                       outDir = "./data/", write = T)
#R log
====BMIQ Normalization Start====
[1] "BMIQ on sample: Sample1"
Sample1 has missing value: 2314
[1] "Fitting EM beta mixture to type1 probes"
[1] Inf
[1] 0.004261074
[1] 0.005484627
[1] 0.005832957
[1] 0.005638393
[1] "Done"
[1] "Fitting EM beta mixture to type2 probes"
[1] Inf
...
...
[1] "Done"
[1] "Start normalising type 2 probes"
[1] "Finished for sample Sample3"
BMIQ normalized Beta value saved as: ./data//UniD_Beta.BMIQ.Rdata
====BMIQ Normalization Finished====

5. Probe filtering

Beta.clean <- UniD.probefilter(Beta.raw = Beta.raw, outDir = "./data/",
                               filterXY = T, filterSNPHit = T, filterMultiHit = T,
                               filterNonCG = F, filterNonEpic = F, arrayType = "EPIC",
                               filterSample = T, filterSample.cut = 0.1, filterProbe = F,
                               write = T)
#R log
====Probe Filter Start====
Delete 19681 probes on ChrX/Y for EPIC Platform.
Delete 78720 probes affected by SNP for EPIC Platform. (Zhou W, Nucleic Acids Research, 2017)
Delete 63 probes mapped to multiple site for EPIC Platform. (Nordlund J, Genome Biology, 2013)
[1] "The failed fraction per sample (after all probe filters): "
        Fail.Frac.NA
Sample1  0.002178632
Sample2  0.002108354
Sample3  0.001668463

Final data matrix is: 3 samples * 768372 probes.
====Probe Filter Finish====

6. Data prediction

Pred <- UniD.pred(inputdata = Beta.raw, inputdata.BMIQ = Beta.BMIQ, InputValueType = "B",
                  Pred.IDH = T, Pred.1p19q = T, Pred.ATRX = T, Pred.TERTp = T,
                  Pred.ExpressSubtype = T, Pred.MGMTExpress = F,
                  outDir = "./data/")
#R log
====Biomarker Prediction Start====
#Note: inputdata and inputdata.BMIQ must with columns as samples
          and rows as probes
#Note: inputdata.BMIQ must be Beta value format

##Predict Chromosome 1p19q co-deletion Start##
Change B value to M value
No missing value for predictor.
#Predict Chromosome 1p19q co-deletion Finish##

##Predict IDH mutation status Start##
Change B value to M value
KNN used to impute 1 missing value
Imputation Done (Check missing value fraction for each sample)
##Predict IDH mutation status Finish##

##Predict ATRX mutation status Start##
Change B value to M value
KNN used to impute 4 missing value
Imputation Done (Check missing value fraction for each sample)
##Predict ATRX mutation status Finish##

##Predict TERT promoter mutation status Start##
Change B value to M value
KNN used to impute 11 missing value
Imputation Done (Check missing value fraction for each sample)
##Predict TERT promoter mutation status Finish##

##Predict TCGA Gene Expression subtype Start##
Change B value to M value
No missing value for predictor.
The following `from` values were not present in `x`: 1, 3
##Predict TCGA Gene Expression subtype Finish##

Predicted result was saved as: ./data//UniD_Biomarker.Pred.csv
====Biomarker Prediction Finish====

Back to Top



JieYang031/UniD documentation built on May 5, 2021, 5:16 p.m.