similarity: Compute semantic similarity metrics between terms
In xu-hong/rphenoscape: Semantically Rich Phenotypic Traits from the Phenoscape Knowledgebase

similarity

R Documentation

Compute semantic similarity metrics between terms

Description

The Tanimoto similarity ST is computed according to the definition for bit vectors (see Jaccard index at Wikipedia). For weights W_i \in \{0, 1\} it is the same as the Jaccard similarity. The Tanimoto similarity can be computed for any term vectors, but for 1 - ST to be a proper distance metric satisfying the triangle inequality, M_{i,j} \in \{0, W_i\} must hold.

The Jaccard similarity is computed using the Tanimoto similarity definition for bit vectors (see Jaccard index at Wikipedia). For the results to be a valid Jaccard similarity, weights must be zero and one. If any weights are different, a warning is issued.

The Cosine similarity SC is computed using the Euclidean dot product formula. See Cosine similarity on Wikipedia. The metric is valid for any term vectors (columns of the subsumer matrix), i.e., M_{i,j} \in \{0, W_i\} is not required. Note that 1 - SC is not a proper distance metric, because it violates the triangle inequality. First convert to angle to obtain a distance metric.

The Resnik similarity between two terms is the information content (IC) of their most informative common ancestor (MICA), which is the common subsumer with the greatest information content.

Usage

tanimoto_similarity(subsumer_mat = NA, terms = NULL, ...)

jaccard_similarity(subsumer_mat = NA, terms = NULL, ...)

cosine_similarity(subsumer_mat = NA, terms = NULL, ...)

resnik_similarity(
  subsumer_mat = NA,
  terms = NULL,
  ...,
  wt = term_freqs,
  wt_args = list(),
  base = 10
)

Arguments

`subsumer_mat`	matrix or data.frame, the vector-encoded matrix M of subsumers for which `M_{i,j} = W_i, W_i > 0` (W = weights), if class i subsumes term j, and 0 otherwise. A binary (`M_{i,j} \in \{0, 1\}`) encoding (i.e., W[i] = 1) can be obtained from `subsumer_matrix()`.
`terms`	character, optionally the list of terms (as IRIs and/or labels) for which to generate a properly encoded subsumer matrix on the fly.
`...`	parameters to be passed on to `subsumer_matrix()` if a subsumer matrix is to be generated on the fly.
`wt`	numeric or a function. If numeric, weights for the subsumer terms. For `resnik_similarity`, these are expected to be information content (IC) scores, though any score will work for which a higher value means higher information content, and where a term will always have a score equal to or greater than any of its superclasses. If a function, it must accept parameter `x` as the vector of term IRIs and return a vector of frequencies (not IC scores) for the terms. The default is to use function `term_freqs()`. Subsumer terms with zero or missing (NA) frequency will be omitted from the calculation.
`wt_args`	list, named parameters for the function calculating term frequencies. Ignored if `wt` is not a function. For the default `wt` function `term_freqs()`, the main parameters are `as` and `corpus`.
`base`	integer, the base of the logarithm for calculating information content from term frequency. The default is 10.

Value

A matrix with M[i,j] = similarity of terms i and j.

References

Philip Resnik (1995). "Using information content to evaluate semantic similarity in a taxonomy". Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI'95). 1: 448–453

Examples

sm <- jaccard_similarity(terms = c("pelvic fin", "pectoral fin",
                                   "forelimb", "hindlimb",
                                   "dorsal fin", "caudal fin"),
                         .colnames = "label")
sm

# e.g., turn into distance matrix, cluster, and plot
plot(hclust(as.dist(1-sm)))
## Not run: 
phens <- get_phenotypes("basihyal bone", taxon = "Cyprinidae")
sm.ic <- resnik_similarity(terms = phens$id,
                           .colnames = "label", .labels = phens$label,
                           wt_args = list(as = "phenotype",
                                          corpus = "taxa"))
maxIC <- -log10(1 / corpus_size("taxa"))
# normalize by max IC, turn into distance matrix, cluster, and plot
plot(hclust(as.dist(1-sm.ic/maxIC)))

## End(Not run)

xu-hong/rphenoscape documentation built on Oct. 30, 2024, 8:43 a.m.

xu-hong/rphenoscape index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

xu-hong/rphenoscape
Semantically Rich Phenotypic Traits from the Phenoscape Knowledgebase

similarity: Compute semantic similarity metrics between terms
In xu-hong/rphenoscape: Semantically Rich Phenotypic Traits from the Phenoscape Knowledgebase

Compute semantic similarity metrics between terms

Description

Usage

Arguments

Value

References

Examples

Related to similarity in xu-hong/rphenoscape...

R Package Documentation

Browse R Packages

We want your feedback!

xu-hong/rphenoscape Semantically Rich Phenotypic Traits from the Phenoscape Knowledgebase

similarity: Compute semantic similarity metrics between terms In xu-hong/rphenoscape: Semantically Rich Phenotypic Traits from the Phenoscape Knowledgebase

Compute semantic similarity metrics between terms

Description

Usage

Arguments

Value

References

Examples

Related to similarity in xu-hong/rphenoscape...

R Package Documentation

Browse R Packages

We want your feedback!

xu-hong/rphenoscape
Semantically Rich Phenotypic Traits from the Phenoscape Knowledgebase

similarity: Compute semantic similarity metrics between terms
In xu-hong/rphenoscape: Semantically Rich Phenotypic Traits from the Phenoscape Knowledgebase