Dichotomize Continuous Data Set With Labels

Share:

Description

dichotomize converts a matrix containing continous measurements into a binary matrix.

optimizeThreshold determines optimal thresholds for dichotomization.

Usage

1
2
dichotomize(X, thresh)
optimizeThreshold(X, L, lambda.freqs, verbose=FALSE)

Arguments

X

data matrix (columns correspond to variables, rows to samples).

thresh

vector of thresholds, one for each variable (column).

L

factor containing the class labels, one for each sample (row).

lambda.freqs

shrinkage parameter for class frequencies (if not specified it is estimated).

verbose

report shrinkage intensity and other information.

Details

dichotomize assigns 0 if a matrix entry is lower than given column-specific threshold, otherwise it assigns 1.

optimizeThreshold uses (approximate) mutual information to determine the optimal thresholds. Specifically, the thresholds are chosen to maximize the mutual information between response and each variable. The same criterion is also used in binda.ranking. For detailed description of the dichotomization procedure see Gibb and Strimmer (2015).

Class frequencies are estimated using freqs.shrink.

Value

dichotomize returns a binary matrix.

optimizeThreshold returns a vector containing the variable thresholds.

Author(s)

Sebastian Gibb and Korbinian Strimmer (http://strimmerlab.org).

References

Gibb, S., and K. Strimmer. 2015. Differential protein expression and peak selection in mass spectrometry data by binary discriminant analysis. Bioinformatics, to appear. http://arxiv.org/abs/1502.07959

See Also

binda.ranking, freqs.shrink, mi.plugin, is.binaryMatrix.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# load binda library
library("binda")

# example data with 6 variables (in columns) and 4 samples (in rows)
X = matrix(c(1, 1, 1, 1.75, 0.4,    0,
             1, 1, 2,    2, 0.4, 0.09,
             1, 0, 1,    1, 0.5,  0.1,
             1, 0, 1,  0.5, 0.6,  0.1), nrow=4, byrow=TRUE)
colnames(X) = paste0("V", 1:ncol(X))

# class labels
L = factor(c("Treatment", "Treatment", "Control", "Control") )
rownames(X) = paste0(L, rep(1:2, times=2))

X
#          V1 V2 V3   V4  V5   V6
#Treatment1  1  1  1 1.75 0.4 0.00
#Treatment2  1  1  2 2.00 0.4 0.09
#Control1    1  0  1 1.00 0.5 0.10
#Control2    1  0  1 0.50 0.6 0.10

# find optimal thresholds (one for each variable)
thr = optimizeThreshold(X, L)
thr
#  V1   V2   V3   V4   V5   V6 
#1.00 1.00 2.00 1.75 0.50 0.10

# convert into binary matrix
# if value is lower than threshold -> 0 otherwise -> 1
Xb = dichotomize(X, thr)
is.binaryMatrix(Xb) # TRUE
Xb
#          V1 V2 V3 V4 V5 V6
#Treatment1  1  1  0  1  0  0
#Treatment2  1  1  1  1  0  0
#Control1    1  0  0  0  1  1
#Control2    1  0  0  0  1  1
#attr(,"thresh")
#  V1   V2   V3   V4   V5   V6 
#1.00 1.00 2.00 1.75 0.50 0.10