# Introduction

The MethPed classifier of pediatric brain tumors presented in the MethPed article [1] can be accessed through the MethPed package. This package includes the probes that were used for building the classification algorithm, the MethPed classifier and a published data set that we have used as an example.

# Necessary packages and installation guide

The MethPed package can be installed through Bioconductor

if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("MethPed")


The MethPed classifier depends on the randomForest package for random forest algorithm. If necessary this can be installed from CRAN.

install.packages("randomForest")


If there are missing values present in the data, the impute package can be installed for missing data imputation. This is however optional and the user can use any algorithm for missing value imputation.

if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("impute")


# Input data type and format for the MethPed classifier

## Input data type

Data for running MethPed is generated by the Illumina Infinium HumanMethylation 450 BeadChip arrays. MethPed assumes that the input data are beta values ($\beta$). Beta values ($\beta$) are the estimate of methylation level using the ratio of intensities between methylated and unmethylated alleles.

$$\beta~=~\frac{Methylated~allele~intensity~(M)}{(Unmethylated~allele~intensity~(U)~+~Methylated~allele~intensity~(M)~+~offset)}$$

The "offset" is chosen to avoid dividing with small values. Illumina uses a default of 100. $\beta$ are between 0 and 1 with 0 being unmethylated and 1 fully methylated.

## Input data format

MethPed assumes that the input data are in 3 specific data classes 1) ExpressionSet class, 2) matrix class and 3) data.frame class.

### ExpressionSet class

The ExpressionSet class is designed to combine several different sources of information into a single convenient structure. This type of class can be manipulated (e.g., subsetted, copied) conveniently, and is the input or output from many Bioconductor functions. Input data in ExpressionSet class is as

## Data Structure

## Data Structure
library(MethPed)
data(MethPed_sample)
MethPed_sample_v1<- MethPed_sample

head(MethPed_sample_v1)

## Data class

class(MethPed_sample_v1)


### matrix class

In the matrix class formatted data, all the values of the matrix are beta values ($\beta$). In the matrix, the probes are found in the rows and the samples in the columns. Input data in “matrix” class is as

library(Biobase)

## Data Structure

MethPed_sample_v2<-Biobase::exprs(MethPed_sample)

## Data class

class(MethPed_sample_v2)


### data.frame class

In the data.frame class formatted data, one column of the dataset are the probes and there is no order specificity of this column. This column can be anywhere in the dataset. The rest of the columns will be the beta values ($\beta$) and each column represents an individual sample. The header of the data frame is the name of the corresponding column (Probe column name i.e. “TargetID” and sample column name i.e. “Tumor.A”, “Tumor.B”). Input data in data.frame class is as

## Data Structure

MethPed_sample_v3 <- data.frame(MethPed_sample_v2)
MethPed_sample_v3$TargetID <- rownames(MethPed_sample_v3) rownames(MethPed_sample_v3) <-NULL head(MethPed_sample_v3)  ## Data class  class(MethPed_sample_v3)  # Workflow of MethPed classifier The MethPed classifier uses the Random Forest (RF) algorithm to classify unknown pediatric brain tumor samples into sub-types. The classification proceeds with the selection of the beta values ($\beta$) needed for the classification. The computational process proceeded in two stages. The first stage commences with a reduction of the probe pool or building the training probe dataset for classification. We have named this dataset as "predictors". The second stage is to apply the RF algorithm to classify the probe data of interest based on the training probe dataset (predictors). For the construction of the training probe pool (predictors), methylation data generated by the Illumina Infinium HumanMethylation 450 BeadChip arrays were downloaded from the Gene Expression Omnibus (GEO). Four hundred seventy-two cases were available, representing several brain tumor diagnoses (DIPG, glioblastoma, ETMR, medulloblastoma, ependymoma, pilocytic astrocytoma) and their further subgroups. Corresponding GEO accession number are given in the following table Table: Datasets used to build the MethPed classifier Diagnosis | GEO accession ----------------------|--------------- DIPG | GSE50022 Glioblastoma | GSE55712, GSE36278 ETMR | GSE52556 Medulloblastoma | GSE54880 Ependymoma | GSE45353 Pilocytic astrocytoma | GSE44684 The data sets were merged and probes that did not appear in all data sets were filtered away. In addition, about 190,000 CpGs were removed due to SNPs, repeats and multiple mapping sites. The final data set contained 206,823 unique probes and nine tumor classes including the medulloblastoma subgroups. K–neighbor imputation was used for missing probe data. After that, a large number of regression analyses were performed to select the 100 probes per tumor class that had the highest predictive power (AUC values). Based on the identified 900 methylation sites, the nine pediatric brain tumor types could be accurately classified using the multiclass classification algorithm MethPed [1]. The complete list of 900 probes can be observed as # List of 900 probes in predictors library(MethPed) data(MethPed_900probes) head(MethPed_900probes)  # A working example of MethPed ## Dataset without missing probe values An example dataset without missing probe beta values ($\beta$), is provided with the MethPed package in ExpressionSet class. This dataset consists of 2 patients and 468,821 probes. At first we can take a look at the data: # Loading package library(MethPed) # Loading data set data(MethPed_sample) head(MethPed_sample) class(MethPed_sample)  ### Check missing value in data Additionally, it is good to check if there are missing probe values. This will identify if any values are missing from each column/sample in the probe data set. For this operation we can use the checkNA command form the MethPed package. Running this function might require some time for large data sets. ## Check for missing value missingIndex <- checkNA(MethPed_sample) missingIndex  In this example, there are no missing values (probe names or beta values) in the data set. Now we are set to run the MethPed algorithm. ### Apply the MethPed classifier The function MethPed has the argument prob which can take two values TRUE or FALSE. The value TRUE will return the conditional probability of a sample belonging to a specific group. FALSE will return the binary classification (0 or 1) of a sample belonging to a specific group (based on maximum conditional probability). By default MethPed will consider TRUE for prob argument. During running the MethPed algorithm, additional information on data and progress of data analysis are seen. The analysis will take longer time depending on the amount of data. set.seed(1000)  # Run the MethPed classifier myClassification <- MethPed(MethPed_sample)  • First message will show the name of the probe column in the data (Default: “TargetID”). • Second message will show "OK" if at least one probe is common between input dataset and predictors (900 probes are used in the predictor [1]. See Workflow of MethPed classifier section for details). • Third message will report about missing value in the data. • Fourth message will list how many probes and samples are in the input data. • Fifth will show the progress of RF analysis. • Sixth will show how many probes are missing in the input data compared with the 900 probes in predictors. The output of the algorithm is partitioned in 6 parts. The full output can be seen by typing myClassification (Variable name defined by user) myClassification  The first part contains the name of the probe column, the second part contains the number of probes and the third part the total number of samples in the main data. Each information can be extracted as # First part myClassification$target_id
# Second part
myClassification$probes # Third part myClassification$sample


The fourth part contains the names of the probes that were included in the original classifier but are missing from the data at hand (input). This can be accessed through the command:

# Fourth part
myClassification$probes_missing  In the example, one probe that was included in the original MethPed classifier, is missing from the present data set. If a large number of probes are missing, this could potentially lead to a drop in the predictive accuracy of the model. However, as can be seen by the fifth part of the result: # Fifth part myClassification$oob_err.rate


the out-of-bag error is 1.95%; only slightly higher than the 1.7% in the original MethPed classifier. For details see the MethPed article [1].

The last output contains the predictions. We chose not to assign tumors to one or another group but to give probabilities of belonging to each group.

# Sixth part
myClassification$predictions  In the example Tumor.A preserves the highest prediction probability, which is 0.796 for subtype MB_Gr3. This should be interpreted as a conditional probability. The probability that Tumor.A belongs to the subtype MB_Gr3 for the given data is 0.796. Similarly, Tumor.B preserves the highest prediction probability, which is 0.771 for the subtype PiloAstro. ### Output summary Alternatively, the summary function can be used to get the result from the MethPed output. summary(myClassification)  If only the maximum conditional probability of samples is preferred, this can be obtained by supplying the FALSE value to the argument prob. set.seed(1000)  myClassification_max <- MethPed(MethPed_sample,prob=FALSE)  myClassification_max <- MethPed(MethPed_sample,prob=FALSE) summary(myClassification_max)  summary(myClassification_max)  ### Output visualization The output of the classification can also be visualized in a bar plot par(mai = c(1, 1, 1, 2), xpd=TRUE) mat<-t(myClassification$predictions)
mycols <- c("green",rainbow(nrow(mat),start=0,end=1)[nrow(mat):1],"red")
barplot(mat,  col = mycols,
beside=FALSE,axisnames=TRUE,
ylim=c(0,1),xlab= "Sample",ylab="Probability")
legend( ncol(mat)+0.5,1,
legend = rownames(mat),fill = mycols,xpd=TRUE, cex = 0.6)


or the plot command to used to get the same bar plot

myClassification <- MethPed(MethPed_sample,prob=TRUE)
plot(myClassification)

plot(myClassification)


If only the maximum conditional probability of samples is preferred, this can be obtained by supplying the FALSE value to the argument prob.

myClassification_max <- MethPed(MethPed_sample,prob=FALSE)
plot(myClassification_max)

plot(myClassification_max)


### Count missing probes

The name of the probes that were included in the original MethPed classifier (900 probes) but are missing from the data at hand, can be accessed by the command probeMis.

probeMis(myClassification)


## Dataset with missing probe values

In the previous section MethPed classification on a dataset without missing values were shown. In this section, MethPed classification on dataset with missing values are shown.

### Missing beta value in the data set

The dataset “MethPed_sample” provided with the package is a non-missing valued dataset. As an example we will now generate a data set with missing probe values.

# Loading dataset
data(MethPed_sample)
class(MethPed_sample)


As the provided dataset is in the ExpressionSet class, we have to extract the matrix of beta values for further process

library(Biobase)

# Loading library
library(Biobase)
# exprs function is loaded from the package Biobase
MethPed_sample_matrix<-Biobase::exprs(MethPed_sample)

MethPed_sample_matrix<-Biobase::exprs(MethPed_sample)

class(MethPed_sample_matrix)


We can check for missing value in the extracted matrix of beta values

checkNA(MethPed_sample_matrix)


There are no missing values in the extracted matrix of beta values. Now we will remove some beta values (Replace with NA) from this matrix to make a missing valued dataset.

MethPed_sample_missing<-MethPed_sample_matrix
MethPed_sample_missing[c(1,10,200),2]<-NA
MethPed_sample_missing[c(4,600,500,1000),1]<-NA


This dataset is now ready for MethPed classification.

### Missing values imputation

MethPed cannot be applied on data sets with missing value. Before applying the MethPed classifier, it’s important to check if there are missing values of probes in the dataset. To check missing values, one can use the ´checkNA´ function

checkNA(MethPed_sample_missing)


In this example there are 2 tumors with in total 7 missing beta values in the data. The classifier requires that all probes that were used for training the model are present for prediction. If this condition is not met, no prediction will be returned for the tumors with missing probes. One possible solution is to discard probes that have missing data. This might not be efficient if a large number of probe's values are missing in one or more samples.

A more feasible approach can be to impute the missing values. Here we illustrate imputation by the means of K-nearest neighbors. For technical details and relevant references we refer readers to the documentation of the impute package. It should be noted that, this package was developed for gene expression data, and not for methylation data. However, most likely the method performs well on methylation data as well [3].

The impute package depends on a random value for initiation. Therefore, to assure randomness, the eventual set of random seeds should be removed.

if(exists(".Random.seed")) rm(.Random.seed)


As mentioned before, the dataset contains 2 tumors. So the imputation will be run on two column of the methylation data.

library(impute)
imputedData <- impute.knn(MethPed_sample_missing)

# Loading library
library(impute)
# Apply imputation
imputedData <- impute.knn(MethPed_sample_missing)
# extract the imputed matrix form object "data"
imputedData <- imputedData$data # Check for missing value checkNA(imputedData)  imputedData <- imputedData$data

checkNA(imputedData)


Now we are set to run the MethPed algorithm on the imputated matrix.

set.seed(1000)

myClassification <- MethPed(imputedData)

myClassification <- MethPed(imputedData)
summary(myClassification)

summary(myClassification)


# Contact and Citation

## Citation

citation('MethPed')


# Authors information

Mohammad Tanvir Ahamed, Anna Danielsson, Szilárd Nemes and Helena Carén, University of Gothenburg, Sweden.

# Session info

sessionInfo()


# References

[1] Anna Danielsson, Szilárd Nemes, Magnus Tisell, Birgitta Lannering, Claes Nordborg, Magnus Sabel, and Helena Carén. "MethPed: A DNA Methylation Classifier Tool for the Identification of Pediatric Brain Tumor Subtypes". Clinical Epigenetics 2015, 7:62, 2015

[2] Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32. doi:10.1023/A:1010933404324

[3] Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. "Missing Value Estimation Methods for DNA Microarrays." Bioinformatics 17.6 (2001): 520-25.

## Try the MethPed package in your browser

Any scripts or data that you put into this service are public.

MethPed documentation built on Nov. 1, 2018, 4:21 a.m.