knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(devtools) install_github("aj-grant/navmix", build_vignettes = TRUE) library(navmix)
The navmix package implements a clustering method which clusters observed vectors based on their direction from the origin. It fits a mixture model of von Mises--Fisher distributions including a noise cluster to provide robustness to outliers. It also includes an automatic method for choosing the number of clusters using the BIC.
The inputs are an $n \times m$ matrix, where $m > 1$ and $n > m$. Each of the $n$ rows represents an $m$-dimensional observation. Because the method clusters based on direction, the first step in the algorithm is to normalise these observed vectors to have norm $1$. Optional paramters are:
The other parameters control convergence criterion and plots.
A motivation for performing the clustering technique is to cluster genetic association data. In this setting, the inputs are estimates of the association between $n$ genetic variants and $m$ different traits. These estimated associations may, for example, be obtained from GWAS summary statistics. The following section illustrates how to cluster genetic association data using navmix.
Consider the case where we wish to cluster $n$ genetic variants according to their associations with $m$ traits. Let B be an $n \times m$ matrix containing the estimated associations ($\beta$ coefficients) and let S be an $n \times m$ matrix containing their standard errors. The recommended input is the standardised associations given by B / S.
B_std = B/S cluster_out = navmix(B_std)
An alternative way to standardise the associations is to include estimates of the correlation between traits. This may
improve the clustering if the traits are highly correlated and the genetic associations are estimated in the same, or
overlapping, sample(s). In this case, the full covariance matrices of the genetic variant-trait associations can be
estimated from the standard errors and trait correlation estimates. Let R be an $m \times m$ matrix with (i, j)th entry
the estimated correlation between traits i and j. The standardised associations can be obtained using the function
row_standardise
.
B_std = navmix::row_standardise(B, S, R) cluster_out = navmix(B_std)
Finally, the unstandardised associations may be used, in which case the input matrix is simply B.
cluster_out = navmix(B)
There are four plot options:
Note that the heat maps and parallel plot are output as ggplot objects, whereas the radial plot is not.
This plot is produced if the optional parameters plot
and plot_heat
are set to TRUE
(both are TRUE by default).
The plotted values are the normalised observations (representing, for example, the proportional associations of the
genetic variants with each trait). The traits are re-ordered so that those that are more alike with respect to their
associations with the observations are closer together in the plot. This re-ordering can be turned off by setting
reorder_traits = FALSE
.
The mean vectors of the fitted vMF distributions represent observations at the center of each cluster. If the optional
paramter plot
is set to TRUE
, a heat map of the mean vectors is produced if plot_heat_mu = TRUE
(default = FALSE),
a parallel plot of the mean vectors is produced if plot_parallel = TRUE
(default = TRUE), and a radial plot is
produced if plot_radial = TRUE
(default = FALSE). The parameter plot_radial_options
is a list with optional
parameters: plot_radial_separate
determines whether the mean vectors are plotted on separate plots instead of the same
plot (default = FALSE); radial_legend_pos
determines the position of the legend in the plot; and radial_separate_col
determines how many columns to output the radial plots if plot_radial_separate = FALSE
(default = 2).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.