char.diff | R Documentation |
Calculates the character difference from a discrete matrix
char.diff(
matrix,
method = "hamming",
translate = TRUE,
special.tokens,
special.behaviours,
order = FALSE,
by.col = TRUE,
correction
)
matrix |
A discrete matrix or a list containing discrete characters. The differences is calculated between the columns (usually characters). Use |
method |
The method to measure difference: |
translate |
|
special.tokens |
optional, a named |
special.behaviours |
optional, a |
order |
|
by.col |
|
correction |
optional, an eventual |
Each method for calculating distance is expressed as a function of d(x, y)
where x
and y
are a pair of columns (if by.col = TRUE
) or rows in the matrix and n is the number of comparable rows (if by.col = TRUE
) or columns between them and i is any specific pair of rows (if by.col = TRUE
) or columns.
The different methods are:
"hamming"
The relative distance between characters. This is equal to the Gower distance for non-numeric comparisons (e.g. character tokens; Gower 1966).
d(x,y) = \sum[i,n](abs(x[i] - y[i])/n
"manhattan"
The "raw" distance between characters:
d(x,y) = \sum[i,n](abs(x[i] - y[i])
"comparable"
The number of comparable characters (i.e. the number of tokens that can be compared):
d(x,y) = \sum[i,n]((x[i] - y[i])/(x[i] - y[i]))
"euclidean"
The euclidean distance between characters:
d(x,y) = \sqrt(\sum[i,n]((x[i] - y[i])^2))
"maximum"
The maximum distance between characters:
d(x,y) = max(abs(x[i] - y[i]))
"mord"
The maximum observable distance between characters (Lloyd 2016):
d(x,y) = \sum[i,n](abs(x[i] - y[i])/\sum[i,n]((x[i] - y[i])/(x[i] - y[i])
"none"
Returns the matrix with eventual converted and/or translated tokens.
"binary"
Returns the matrix with the binary characters.
When using translate = TRUE
, the characters are translated following the xyz notation where the first token is translated to 1, the second to 2, etc. For example, the character 0, 2, 1, 0
is translated to 1, 2, 3, 1
. In other words when translate = TRUE
, the character tokens are not interpreted as numeric values. When using translate = TRUE
, scaled metrics (i.e "hamming"
and "gower"
) are divide by n-1
rather than n
due to the first character always being equal to 1.
special.behaviours
allows to generate a special rule for the special.tokens
. The functions should can take the arguments character, all_states
with character
being the character that contains the special token and all_states
for the character (which is automatically detected by the function). By default, missing data returns and inapplicable returns NA
, and polymorphisms and uncertainties return all present states.
missing = function(x,y) NA
inapplicable = function(x,y) NA
polymorphism = function(x,y) strsplit(x, split = "\\&")[[1]]
uncertainty = function(x,y) strsplit(x, split = "\\/")[[1]]
Functions in the list must be named following the special token of concern (e.g. missing
), have only x, y
as inputs and a single output a single value (that gets coerced to integer
automatically). For example, the special behaviour for the special token "?"
can be coded as: special.behaviours = list(missing = function(x, y) return(y)
to make all comparisons containing the special token containing "?"
return any character state y
.
IMPORTANT: Note that for any distance method, NA
values are skipped in the distance calculations (e.g. distance(A = {1, NA, 2}, B = {1, 2, 3}
) is treated as distance(A = {1, 2}, B = {1, 3}
)).
IMPORTANT: Note that the number of symbols (tokens) per character is limited by your machine's word-size (32 or 64 bits). If you have more than 64 tokens per character, you might want to use continuous data.
A character difference value or a matrix of class char.diff
Thomas Guillerme
Felsenstein, J. 2004. Inferring phylogenies vol. 2. Sinauer Associates Sunderland. Gower, J.C. 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53:325-338. Hamming, R.W. 1950. Error detecting and error correcting codes. The Bell System Technical Journal. DOI: 10.1002/j.1538-7305.1950.tb00463.x. Lloyd, G.T. 2016. Estimating morphological diversity and tempo with discrete character-taxon matrices: implementation, challenges, progress, and future directions. Biological Journal of the Linnean Society. DOI: 10.1111/bij.12746.
plot.char.diff
, vegdist
, dist
, calculate_morphological_distances
, daisy
## Comparing two binary characters
char.diff(list(c(0, 1, 0, 1), c(0, 1, 1, 1)))
## Pairwise comparisons in a morphological matrix
morpho_matrix <- matrix(sample(c(0,1), 100, replace = TRUE), 10)
char.diff(morpho_matrix)
## Adding special tokens to the matrix
morpho_matrix[sample(1:100, 10)] <- c("?", "0&1", "-")
char.diff(morpho_matrix)
## Modifying special behaviours for tokens with "&" to be treated as NA
char.diff(morpho_matrix,
special.behaviours = list(polymorphism = function(x,y) return(NA)))
## Adding a special character with a special behaviour (count "%" as "100")
morpho_matrix[sample(1:100, 5)] <- "%"
char.diff(morpho_matrix,
special.tokens = c("paragraph" = "\\%"),
special.behaviours = list(paragraph = function(x,y) as.integer(100)))
## Comparing characters with/without translation
char.diff(list(c(0, 1, 0, 1), c(1, 0, 1, 0)), method = "manhattan")
# no character difference
char.diff(list(c(0, 1, 0, 1), c(1, 0, 1, 0)), method = "manhattan",
translate = FALSE)
# all four character states are different
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.