View source: R/distances_omv.R
distances_omv | R Documentation |
Calculates distances (returning a symmetric matrix) from a raw data matrix in .omv-files for the statistical spreadsheet 'jamovi' (https://www.jamovi.org)
distances_omv(
dtaInp = NULL,
fleOut = "",
varDst = c(),
clmDst = TRUE,
stdDst = "none",
nmeDst = "euclid",
mtxSps = FALSE,
mtxTrL = FALSE,
mtxDgn = TRUE,
usePkg = c("foreign", "haven"),
selSet = "",
...
)
dtaInp |
Either a data frame or the name of a data file to be read (including the path, if required; "FILENAME.ext"; default: NULL); files can be of any supported file type, see Details below. |
fleOut |
Name of the data file to be written (including the path, if required; "FILE_OUT.omv"; default: ""); if empty, the resulting data frame is returned instead. |
varDst |
Variable (default: c()) containing a character vector with the names of the variables for which distances are to be calculated. See Details for more information. |
clmDst |
Whether the distances shall be calculated between columns (TRUE) or rows (FALSE; default: TRUE). See Details for more information. |
stdDst |
Character string indicating whether the variables in varDst are to be standardized and how (default: "none"). See Details for more information. |
nmeDst |
Character string indicating which distance measure is to be calculated calculated (default: "euclidean"). See Details for more information. |
mtxSps |
Whether the symmetric matrix to be returned should be sparse (default: FALSE) |
mtxTrL |
Whether the symmetric matrix to be returned should only contain the lower triangular (default: FALSE) |
mtxDgn |
Whether the symmetric matrix to be returned should retain the values in the main diagonal (default: TRUE) |
usePkg |
Name of the package: "foreign" or "haven" that shall be used to read SPSS, Stata, and SAS files; "foreign" is the default (it comes with base R), but "haven" is newer and more comprehensive. |
selSet |
Name of the data set that is to be selected from the workspace (only applies when reading .RData-files) |
... |
Additional arguments passed on to methods; see Details below. |
varDst
must a character vector containing the variables to calculated distances over. If
clmDst
is set to TRUE, distances are calculated between all possible variable pairs and over
subjects / rows in the original data frame. If clmDst
is set to FALSE, distances are
calculated between participants and over all variables given in varDst
. If clmDst
is set
to TRUE
, the symmetric matrix that is returned has the size V x V (V being the number of
variables in varDst; if mtxSps
is set to TRUE
, the size is V - 1 x V - 1, see below); if
clmDst
is set to FALSE
, the symmetric matrix that is returned has the size R x R (R being
the number of rows in the original dataset; it is if mtxSps
is set to TRUE
, the size is
R - 1 x R - 1, see below).
stdDst
can be one of the following calculations to standardize the selected variables before
calculating the distances: none
(do not standardize; default), z
(z scores), sd
(divide
by the std. dev.), range
(divide by the range), max
(divide by the absolute maximum),
mean
(divide by the mean), rescale
(subtract the mean and divide by the range).
nmeDst
can be one of the following distance measures.
(1) For interval data: euclid
(Euclidean), seuclid
(squared Euclidean), block
(city
block / Manhattan), canberra
(Canberra). chebychev
(maximum distance / supremum norm /
Chebychev), minkowski_p
(Minkowski with power p; NB: needs p), power_p_r
(Minkowski with
power p, and the r-th root; NB: needs p and r), cosine
(cosine between the two vectors),
correlation
(correlation between the two vectors).
(2) For frequency count data: chisq
(chi-square dissimilarity between two sets of
frequencies), ph2
(chi-square dissimilarity normalized by the square root of the number
of values used in the calculation).
(3) For binary data, all measure have to optional parts p
and np
which indicate presence
(p
; defaults to 1 if not given) or absence (np
; defaults to zero if not given).
(a) matching coefficients: rr_p_np
(Russell and Rao), sm_p_np
(simple matching),
jaccard_p_np
/ jaccards_p_np
(Jaccard similarity; as in SPSS), jaccardd_p_np
(Jaccard
dissimiliarity; as in dist(..., "binary")
in R), dice_p_np
(Dice or Czekanowski or
Sorenson similarity), ss1_p_np
(Sokal and Sneath measure 1), rt_p_np
(Rogers and
Tanimoto), ss2_p_np
(Sokal and Sneath measure 2), k1_p_np
(Kulczynski measure 1),
ss3_p_np
(Sokal and Sneath measure 3).
(b) conditional probabilities: k2_p_np
(Kulczynski measure 2), ss4_p_np
(Sokal and Sneath
measure 4), hamann_p_np
(Hamann).
(c) predictability measures: lambda_p_np
(Goodman and Kruskal Lambda), d_p_np
(Anderberg’s
D), y_p_np
(Yule’s Y coefficient of colligation), q_p_np
(Yule’s Q).
(d) other measures: ochiai_p_np
(Ochiai), ss5_p_np
(Sokal and Sneath measure 5),
phi_p_np
(fourfold point correlation), beuclid_p_np
(binary Euclidean distance),
bseuclid_p_np
(binary squared Euclidean distance), size_p_np
(size difference),
pattern_p_np
(pattern difference), bshape_p_np
(binary Shape difference), disper_p_np
(dispersion similarity), variance_p_np
(variance dissimilarity), blwmn_p_np
(binary Lance
and Williams non-metric dissimilarity).
(4) none
(only carry out standardization, if stdDst is different from none
).
If mtxSps
is set, a sparse matrix is returned. Those matrices are similar to the format one
often finds for correlation matrices. The values are only retained in the lower triangular,
the columns range from the first to the variable that is second to the last in varDst
(or
respectively, the columns contain the first to the second to the last row of the original
dataset when clmDst
is set to FALSE
), and the rows contain the second to the last variable
in varDst
(or respectively, the rows contain the second to the last row of the original
dataset when clmDst
is set to FALSE
).
By default, a full symmetric matrix is returned (i.e., a matrix that has no NAs in any cell).
This behaviour can be changed with setting mtxTrL
and mtxDgn
: If mtxTrL
is set to
TRUE
, the values from the upper triangular matrix are removed / replaced with NAs; if
mtxDgn
is set to FALSE
, the values from the main diagonal are removed / replaced with NAs.
The ellipsis-parameter (...
) can be used to submit arguments / parameters to the functions
that are used for reading and writing the data. By clicking on the respective function under
“See also”, you can get a more detailed overview over which parameters each of those functions
take. The functions are: read_omv
and write_omv
(for jamovi-files), read.table
(for CSV
/ TSV files; using similar defaults as read.csv
for CSV and read.delim
for TSV which both
are based upon read.table
), load
(for .RData-files), readRDS
(for .rds-files),
read_sav
(needs the R-package haven
) or read.spss
(needs the R-package foreign
) for
SPSS-files, read_dta
(haven
) / read.dta
(foreign
) for Stata-files, read_sas
(haven
) for SAS-data-files, and read_xpt
(haven
) / read.xport
(foreign
) for
SAS-transport-files. If you would like to use haven
, you may need to install it using
install.packages("haven", dep = TRUE)
.
a data frame containing a symmetric matrix (only returned if fleOut
is empty)
containing the distances between the variables / columns (clmDst == TRUE) or rows
(clmDst == FALSE)
distances_omv
internally uses the following function for calculating the distances
for interval data stats::dist()
. It furthermore uses the following functions for reading
and writing data files in different formats: read_omv()
and
write_omv()
for jamovi-files, utils::read.table()
for CSV / TSV files,
load()
for reading .RData-files, readRDS()
for .rds-files, haven::read_sav()
or
foreign::read.spss()
for SPSS-files, haven::read_dta()
or foreign::read.dta()
for
Stata-files, haven::read_sas()
for SAS-data-files, and haven::read_xpt()
or
foreign::read.xport()
for SAS-transport-files.
## Not run:
# create matrices for the different types of distance measures: continuous
# (cntFrm), frequency counts (frqFrm) or binary (binFrm); all 20 R x 5 C
set.seed(1)
cntFrm <- stats::setNames(as.data.frame(matrix(rnorm(100, sd = 10),
ncol = 5)), sprintf("C_%02d", seq(5)))
frqFrm <- stats::setNames(as.data.frame(matrix(sample(seq(10), 100,
replace = TRUE), ncol = 5)), sprintf("F_%02d", seq(5)))
binFrm <- stats::setNames(as.data.frame(matrix(sample(c(TRUE, FALSE), 100,
replace = TRUE), ncol = 5)), sprintf("B_%02d", seq(5)))
nmeOut <- tempfile(fileext = ".omv")
# calculates the distances between columns, nmeDst is not required: "euclid"
# is the default
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
names(cntFrm), nmeDst = "euclid")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the Euclidian distances
print(dtaFrm)
# calculates the (Euclidean) distances between rows (clmDst = FALSE)
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
names(cntFrm), clmDst = FALSE, nmeDst = "euclid")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (20 x 20) with the Euclidian distances
print(dtaFrm)
# calculates the (Euclidean) distances between columns; the original data
# are z-standardized before calculating the distances (stdDst = "z")
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
names(cntFrm), stdDst = "z", nmeDst = "euclid")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the Euclidian distances using the
# z-standardized data
print(dtaFrm)
# calculates the correlations between columns
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
names(cntFrm), nmeDst = "correlation")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the correlations
print(dtaFrm)
# calculates the chi-square dissimilarity (nmeDst = "chisq") between columns
jmvReadWrite::distances_omv(dtaInp = frqFrm, fleOut = nmeOut, varDst =
names(frqFrm), nmeDst = "chisq")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the chi-square dissimilarities
print(dtaFrm)
# calculates the Jaccard similarity (nmeDst = "jaccard") between columns
jmvReadWrite::distances_omv(dtaInp = binFrm, fleOut = nmeOut, varDst =
names(binFrm), nmeDst = "jaccard")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the Jaccard similarities
print(dtaFrm)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.