View source: R/dissimilarity.R
append_dissimilarities | R Documentation |
Runs cluster::daisy()
on a data frame, breaks up the columns of the
resulting dissimilarity into a list, and adds this list to the data frame as
a list column. In addition or instead, it adds a transformed version of the
dissimilarity list, which can be used as sampling weights.
append_dissimilarities(
data,
cols = dplyr::everything(),
dissimilarity_measure_name = "dissimilarities",
sampling_weight_name = "sampling_weights",
metric = "gower",
...
)
data |
A data frame that has at least one row and at least one column. |
cols |
< |
dissimilarity_measure_name , sampling_weight_name |
The names of the list
columns that will be added to |
metric , ... |
Passed to |
All columns are fed to cluster::daisy()
by default, but the user can select
which ones using the cols
argument.
Once the full dissimilarity matrix is obtained, the columns are separated
into a list via asplit()
and appended to data
. Each element of the list
is therefore a double vector with nrow
(data)
values. For any given
row, its dissimilarity vector represents the row's dissimilarity to every
row.
The optional/alternative "sampling weight" column is a transformed version of the dissimilarity list: 1. All dissimilarity measures of 0 are replaced with the next smallest dissimilarity value in the vector. In effect, this means that a row's dissimilarity to itself (and any rows identical to it) is replaced with the dissimilarity value of its next most similar row. (Exception: if all elements are 0, all of them are replaced with 1). 2. Then the reciprocal of each element is taken so that larger values represent greater similarity. 3. Each element is divided by the sum of the vector, which standardizes the elements to add to 1.
Requires the package cluster
to be installed.
A data frame, specifically the data
argument with one or two more
columns added to the end.
# Running this on all mtcars columns
mtdissim <- append_dissimilarities(mtcars)
# Therefore, these numbers represent the dissimilarity of each row to the
# fifth row:
mtdissim$dissimilarities[[5]]
# And these are the dissimilarities' corresponding sampling weights:
mtdissim$sampling_weights[[5]]
# Now we run it on mtcars without the wt and qsec colums so that we purposely
# end up with some duplicate rows (the first and second).
mtdissim_dup <- append_dissimilarities(mtcars, cols = !c(wt, qsec))
# These represent each row's dissimilarity to its first row.
# Since we specifically told it not to take wt and qsec into account, the
# first two rows are identical. Therefore, both values are zero.
mtdissim_dup$dissimilarities[[1]]
# Here are the corresponding sampling weights. Notice that the first two
# rows' sampling weights are the same as the sampling weight of row 30, which
# is the next most similar row.
mtdissim_dup$sampling_weights[[1]]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.