calculate batch effect score

Description

Returns a table with the number of found genes with found p-values less or equal to 0.01 and median values greater or equal to 0.05. A score is calculated depending on the number of found genes as well as the magnitude of the median difference values, this score is divided by the overall number of genes in the data and returned as "BEscore". See details for further information and details about the score calculation. The returned data.frame is also stored in the specified directory as .RData file.

Usage

1
calcScore(data, samples, summary, dir=getwd())

Arguments

data

any matrix filled with beta values, column names have to be sample_ids corresponding to the ids listed in "samples", row names have to be gene names.

samples

data frame with two columns, the first column has to contain the sample numbers, the second column has to contain the corresponding batch number. Colnames have to be named as "sample_id" and "batch_id".

summary

a summary data.frame containing the columns "gene", "batch", "median" and "p-value" and consists of all genes which were found in the median and p-value calculations, see calcSummary function for more details.

dir

set the path to a directory the returned data.frame should be stored. The current working directory is defined as default parameter.

Details

The returned data frame contains one column for the batch numbers, 11 columns containing the number of genes found in a certain range of the median difference value and a column with the calculated BEscore. These found genes are assumed to be batch affected due to their difference in median values and their different distribution of the beta values. The higher the found number of genes and the more extreme the median difference is, the more severe is the assumed batch effect supposed to be. We suggest that there is no need for a batch effect correction if the BEscore for a batch is less than 0.02. BEscores between 0.02 and 0.1 are lying in a "grey zone" for which we assume a not severe batch effect, and values beyond 0.1 certainly describe a batch effect and should definitely be corrected.

The 11 columns containing the numbers of found genes count the median difference values which are ranging from >= 0.05 to < 0.1 ; >= 0.1 to < 0.2 ; >= 0.2 to < 0.3 and so on up to a limit of 1.

The BEscore is calculated by the sum of the weighted number of genes divided by the number of genes. Weightings are calculated by multiplicating the number of found genes between 0.05 and 0.1 by 1, between 0.1 and 0.2 by 2, between 0.2 and 0.3 by 4, between 0.3 and 0.4 by 6 and so on.

Value

A data.frame is returned containing the number of found genes assumed to be batch affected separated by batch and a BEscore for every batch. The data.frame is also stored in the specified directory as .RData file.

See Also

calcMedians calcPvalues calcSummary correctBatchEffect

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
## Shortly running example. For a more realistic example that takes
## some more time, run the same procedure with the full BEclearData
## dataset.

## Whole procedure that has to be done to use this function.
data(BEclearData)
ex.data <- ex.data[31:90,7:26]
ex.samples <- ex.samples[7:26,]

# Calculates median difference values and p-values from the example data
med <- calcMedians(data=ex.data, samples=ex.samples, parallel=FALSE)
pvals <- calcPvalues(data=ex.data, samples=ex.samples, parallel=FALSE,
    adjusted=TRUE, method="fdr")
    
# Summarize p-values and median differences for batch affected genes 
sum <- calcSummary(medians=med, pvalues=pvals)

# Calculates the score table
score.table <- calcScore(data=ex.data, samples=ex.samples, summary=sum,
    dir=getwd())