correctBatchEffect: Correct a batch effect in DNA methylation data

Description Usage Arguments Details Value References See Also Examples

View source: R/correctBatchEffect.R

Description

This method combines most functions of the BEclear-package to one. The method performs the whole process of searching for batch effects and automatically correct them for a matrix of beta values stemming from DNA methylation data.

Usage

1
2
3
4
correctBatchEffect(data, samples, adjusted=TRUE, method="fdr",
mediansTreshold = 0.05, pvaluesTreshold = 0.01, rowBlockSize=60, 
colBlockSize=60, epochs=50, lambda = 1, gamma = 0.01, r = 10,
outputFormat="", dir=getwd(), BPPARAM=SerialParam(), fixedSeed= TRUE)

Arguments

data

any matrix filled with beta values, column names have to be sample_ids corresponding to the ids listed in "samples", row names have to be gene names.

samples

data frame with two columns, the first column has to contain the sample numbers, the second column has to contain the corresponding batch number. Colnames have to be named as "sample_id" and "batch_id".

adjusted

should the p-values be adjusted or not, see "method" for available adjustment methods.

method

adjustment method for p-value adjustment, default method is "false discovery rate adjustment", for other available methods see the description of the used standard R package p.adjust. See calcBatchEffects for more information.

mediansTreshold

the threshold above or equal median values are regarded as batch effected, when the criteria for p-values is also met.

pvaluesTreshold

the threshold below or equal p-values are regarded as batch effected, when the criteria for medians is also met.

rowBlockSize

the number of rows that is used in a block if the function is run in parallel mode and/or not on the whole matrix. Set this, and the "colBlockSize" parameter to 0 if you want to run the function on the whole input matrix. See imputeMissingData and especially the details section of See imputeMissingData for more information about this feature.

colBlockSize

the number of columns that is used in a block if the function is run in parallel mode and/or not on the whole matrix. Set this, and the "rowBlockSize" parameter to 0 if you want to run the function on the whole input matrix. See imputeMissingData and especially the details section of See imputeMissingData for more information about this feature.

epochs

the number of iterations used in the gradient descent algorithm to predict the missing entries in the data matrix. See imputeMissingData for more information.

lambda

constant that controls the extent of regularization during the gradient descent

gamma

constant that controls the extent of the shift of parameters during the gradient descent

r

length of the second dimension of variable matrices R and L

outputFormat

you can choose if the finally returned data matrix should be saved as an .RData file or as a tab-delimited .txt file in the specified directory. Allowed values are "RData" and "txt". See imputeMissingData for more information.

dir

set the path to a directory the predicted matrix should be stored. The current working directory is defined as default parameter.

BPPARAM

An instance of the BiocParallelParam-class that determines how to parallelisation of the functions will be evaluated.

fixedSeed

determines if they seed should be fixed, which is important for testing

Details

correctBatchEffect

The function performs the whole process of searching for batch effects and automatically correct them for a matrix of beta values stemming from DNA methylation data. Thereby, the function is running most of the functions from the BEclear-package in a logical order.
First, median comparison values are calculated by the calcBatchEffects function, followed by the calculation of p-values also by the calcBatchEffects function. With the results from the median comparison and p-value calculation, a summary data frame is build using the calcSummary function, and a scoring table is established by the calcScore function. Now, found entries from the summary are set to NA in the input matrix using the clearBEgenes function, then the imputeMissingData function is used to predict the missing values and at the end, predicted entries outside the boundaries (values lower than 0 or greater than 1) are corrected using the replaceOutsideValues function.

Value

A list containing the following fields (for detailed information look at the documentations of the corresponding functions):

medians

A data.frame containing all median comparison values corresponding to the input matrix.

pvalues

A data.frame containing all p-values corresponding to the input matrix.

summary

The summarized results of the median comparison and p-value calculation.

score.table

A data.frame containing the number of found genes and a BEscore for every batch.

cleared.data

the input matrix with all values defined in the summary set to NA.

predicted.data

the input matrix after all previously NA values have been predicted.

corrected.predicted.data

the predicted matrix after the correction for predicted values outside the boundaries.

References

\insertRef

Akulenko2016BEclear

\insertRef

Koren2009BEclear

\insertRef

Candes2009BEclear

See Also

calcBatchEffects

calcSummary

calcScore

clearBEgenes

imputeMissingData

replaceOutsideValues

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## Shortly running example. For a more realistic example that takes
## some more time, run the same procedure with the full BEclearData
## dataset.

## Whole procedure that has to be done to use this function.
## Correct the example data for a batch effect
data(BEclearData)
ex.data <- ex.data[31:90, 7:26]
ex.samples <- ex.samples[7:26, ]

# Note that row- and block sizes are just set to 10 to get a short runtime.
# To use these parameters, either use the default values or please note the
# description in the details section of \code{\link{imputeMissingData}}
result <- correctBatchEffect(
  data = ex.data, samples = ex.samples,
  adjusted = TRUE, method = "fdr", rowBlockSize = 10, colBlockSize = 10,
  epochs = 50, outputFormat = "RData", dir = getwd()
)

# Unlist variables
medians <- result$medians
pvals <- result$pvals
summary <- result$summary
score <- result$score.table
cleared <- result$clearedData
predicted <- result$predictedData
corrected <- result$correctedPredictedData

Example output

Loading required package: BiocParallel
INFO [2021-01-29 18:14:37] Transforming matrix to data.table
INFO [2021-01-29 18:14:37] Calculate the batch effects for 4 batches
INFO [2021-01-29 18:14:38] Adjusting p-values
INFO [2021-01-29 18:14:38] Generating a summary table
INFO [2021-01-29 18:14:38] Calculating the scores for 4 batches
INFO [2021-01-29 18:14:38] Removing values with batch effect:
INFO [2021-01-29 18:14:38] 70 values ( 5.83333333333333 % of the data) set to NA
INFO [2021-01-29 18:14:38] Starting the imputation of missing values.
INFO [2021-01-29 18:14:38] This might take a while.
INFO [2021-01-29 18:14:38] BEclear imputation is started:
INFO [2021-01-29 18:14:38] block size: 10  x  10
INFO [2021-01-29 18:14:38] Impute missing data for block 1 of 12
INFO [2021-01-29 18:14:38] Impute missing data for block 2 of 12
INFO [2021-01-29 18:14:38] Impute missing data for block 3 of 12
INFO [2021-01-29 18:14:38] Impute missing data for block 4 of 12
INFO [2021-01-29 18:14:38] Impute missing data for block 5 of 12
INFO [2021-01-29 18:14:38] Impute missing data for block 6 of 12
INFO [2021-01-29 18:14:38] Impute missing data for block 7 of 12
INFO [2021-01-29 18:14:38] Impute missing data for block 8 of 12
INFO [2021-01-29 18:14:38] Impute missing data for block 9 of 12
INFO [2021-01-29 18:14:38] Impute missing data for block 10 of 12
INFO [2021-01-29 18:14:38] Impute missing data for block 11 of 12
INFO [2021-01-29 18:14:38] Impute missing data for block 12 of 12
INFO [2021-01-29 18:14:38] Replacing values below 0 or above 1:
INFO [2021-01-29 18:14:38] 0 values replaced

BEclear documentation built on Nov. 8, 2020, 8:07 p.m.