Description Usage Arguments Details Value References See Also Examples
These are the helper functions called by
do_disc
(which is the function normally
applied by an opm user). discrete
converts
continuous numeric characters to discrete ones.
best_cutoff
determines the best cutoff for
dividing a numeric matrix into two categories by
minimising within-group discrepancies. That is, for each
combination of row group and column maximise the number
of contained elements that are in the category in which
most of the elements within this combination of row group
and column are located.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ## S4 method for signature 'matrix,character'
best_cutoff(x, y, ...)
## S4 method for signature 'matrix,factor'
best_cutoff(x, y, combined = TRUE, lower = min(x, na.rm = TRUE),
upper = max(x, na.rm = TRUE), all = FALSE)
## S4 method for signature 'array'
discrete(x, ...)
## S4 method for signature 'data.frame'
discrete(x, ..., as.labels = NULL, sep = " ")
## S4 method for signature 'numeric'
discrete(x, range, gap = FALSE,
output = c("character", "integer", "logical", "factor", "numeric"),
middle.na = TRUE, states = 32L, ...)
|
x |
Numeric vector or array object convertible to a
numeric vector. The data-frame method first calls
|
range |
If a numeric vector, in non- If The number of clusters is set to |
gap |
Logical scalar. If If |
output |
String determining the output mode:
‘character’, ‘integer’, ‘logical’,
‘factor’, or ‘numeric’. ‘numeric’
simply returns |
middle.na |
Logical scalar. Only relevant in
If |
states |
Integer or character vector. Ignored in
In the latter case, a single integer is interpreted as the upper bound of an integer vector starting at 1. |
as.labels |
Vector of data-frame indexes. See
|
sep |
Character scalar. See |
y |
Factor or character vector indicating group
affiliations. Its length must correspond to the number of
rows of |
combined |
Logical scalar. If |
lower |
Numeric scalar. Lower bound for the cutoff values to test. |
upper |
Numeric scalar. Upper bound for the cutoff values to test. |
all |
Logical scalar. If |
... |
Optional arguments passed between the methods
or, if requested, to |
One of the uses of discrete
is to create character
data suitable for phylogenetic studies with programs such
as PAUP* and RAxML. These accept only
discrete characters with at most 32 states, coded as 0 to
9 followed by A to V. For the full export one
additionally needs phylo_data
. The matrix
method is just a wrapper that takes care of the matrix
dimensions, and the data-frame method is a wrapper for
that method.
The term ‘character’ as used here has no direct connection to the eponymous mode or class of R. Rather, the term is borrowed from taxonomic classification in biology, where, technically, a single ‘character’ is stored in one column of a data matrix if each organism is stored in one row. Characters are the quasi-independent units of evolution on the one hand and of phylogenetic reconstruction (and thus taxonomic classification) on the other hand.
The scoring function to be maximised by
best_cutoff
is calculated as follows. All values
in x
are divided into those larger then the cutoff
and those at most large as the cutoff. For each
combination of group and matrix column the frequencies of
the two categories are determined, and the respective
larger ones are summed up over all combinations. This
value is then divided by the frequency over the entire
matrix of the more frequent of the two categories. This
is done to avoid trivial solutions with minimal and
maximal cutoffs, causing all values to be placed in the
same category.
discrete
generates a double, integer, character or
logical vector or factor, depending on output
. For
the matrix method, a matrix composed of a vector as
produced by the numeric method, the original
dimensions
and the original dimnames
attributes of x
.
If combined
is TRUE
, best_cutoff
yields either a matrix or a vector: If all
is
TRUE
, a two-column matrix with (i) the cutoffs
examined and (ii) the resulting scores. If all
is
FALSE
, a vector with the entries ‘maximum’
(the best cutoff) and ‘objective’ (the score it
achieved). If combined
is FALSE
, either a
list of matrices or a matrix. If all
is
TRUE
, a list of matrices structures like the
single matrix returned if combined
is TRUE
.
If all
is FALSE
, a matrix with two columns
called ‘maximum’ ‘objective’, and one row
per level of y
.
Dougherty, J., Kohavi, R., Sahami, M. 1995 Supervised and unsupervised discretisation of continuous features. In: Prieditis, A., Russell, S. (eds.) Machine Learning: Proceedings of the fifth international conference.
Ventura, D., Martinez, T. R. 1995 An empirical comparison of discretisation methods. Proceedings of the Tenth International Symposium on Computer and Information Sciences, p. 443–450.
Wiley, E. O., Lieberman, B. S. 2011 Phylogenetics: Theory and Practice of Phylogenetic Systematics. Hoboken, New Jersey: Wiley-Blackwell.
Bunuel, L. 1972 Le charme discret de la bourgeoisie. France/Spain, 96 min.
base::cut stats::optimize
Other discretization-functions: do_disc
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | # Treat everything between 3.4 and 4.5 as ambiguous
(x <- discrete(1:5, range = c(3.5, 4.5), gap = TRUE))
stopifnot(x == c("0", "0", "0", "?", "1"))
# Treat everything between 3.4 and 4.5 as intermediate
(x <- discrete(1:5, range = c(3.5, 4.5), gap = TRUE, middle.na = FALSE))
stopifnot(x == c("0", "0", "0", "1", "2"))
# Boring example: real and possible range as well as the number of states
# to code the data have a 1:1 relationship
(x <- discrete(1:5, range = c(1, 5), states = 5))
stopifnot(identical(x, as.character(0:4)))
# Now fit the data into a potential range twice as large, and at the
# beginning of it
(x <- discrete(1:5, range = c(1, 10), states = 5))
stopifnot(identical(x, as.character(c(0, 0, 1, 1, 2))))
# Matrix and data-frame methods
x <- matrix(as.numeric(1:10), ncol = 2)
(y <- discrete(x, range = c(3.4, 4.5), gap = TRUE))
stopifnot(identical(dim(x), dim(y)))
(yy <- discrete(as.data.frame(x), range = c(3.4, 4.5), gap = TRUE))
stopifnot(y == yy)
# K-means based discretisation of PM data (prefer do_disc() for this)
x <- extract(vaas_4, as.labels = list("Species", "Strain"),
in.parens = FALSE)
(y <- discrete(x, range = TRUE, gap = TRUE))[, 1:3]
stopifnot(c("0", "?", "1") %in% y)
## best_cutoff()
x <- matrix(c(5:2, 1:2, 7:8), ncol = 2)
grps <- c("a", "a", "b", "b")
# combined optimisation
(y <- best_cutoff(x, grps))
stopifnot(is.numeric(y), length(y) == 2) # two-element numeric vector
stopifnot(y[["maximum"]] < 4, y[["maximum"]] > 3, y[["objective"]] == 2)
plot(best_cutoff(x, grps, all = TRUE), type = "l")
# separate optimisation
(y <- best_cutoff(x, grps, combined = FALSE))
stopifnot(is.matrix(y), dim(y) == c(2, 2)) # numeric matrix
stopifnot(y["a", "objective"] == 2, y["b", "objective"] == 2)
(y <- best_cutoff(x, grps, combined = FALSE, all = TRUE))
plot(y$a, type = "l")
plot(y$b, type = "l")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.