knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
clustur was developed to be similar to mothur's cluster
function that was written in C++. In order to
cluster your data, users need to provide their own sparse or phylip-formatted
distance matrix. They also need to provide a count table that either comes from
mothur or that they create in R. Once these objects are built users can call the
cluster()
function. We currently support 5 methods: opticlust (default) and
furthest, nearest, weighted, and average neighbor. The opticlust method is
cluster()
and mothur's default. The speed of the methods implemented in
{clustur} and mothur are comparable; {clustur} may even be faster! Below we will
show you how to create your sparse matrix and count table. If you do not have a
count table, clustur can produce one from you, but it will assume the abundance
of each sequence is one and it will only cluster the sequences in the distance
matrix. The output of running clustur()
includes what is typically provided in
a mothur-formatted shared file.
For the official release from CRAN you can use the standard install.packages()
function:
# install via cran install.packages("clustur")
For the developmental version, you can use the install_github()
function from
the {devtools} package
# install via github devtools::install_github("SchlossLab/clustur")
Because {clustur}'s functions make use of a random number generator, users are strongly encouraged to set the seed.
library(clustur) set.seed(19760620)
clustur will produce the same output using either a sparse (default) or full count table
full_count_table <- read_count(example_path("amazon.full.count_table")) sparse_count_table <- read_count(example_path("amazon.sparse.count_table"))
clustur will read both mothur's column/sparse distance matrix and Phylip-formatted distance matrix formats.
column_distance <- read_dist(example_path("amazon_column.dist"), full_count_table, cutoff = 0.03)
or
phylip_distance <- read_dist(example_path("amazon_phylip.dist"), full_count_table, cutoff = 0.03)
The return value of distance_data
will be a memory address. If you want a data
frame version of the distances, you can use get_distance_df(distance_data)
.
get_distance_df(column_distance) get_distance_df(phylip_distance)
The default method for clustering in cluster
is "opticlust"
cutoff <- 0.03 cluster_data <- cluster(column_distance, cutoff)
cluster_data <- cluster(column_distance, cutoff, method = "furthest") cluster_data <- cluster(column_distance, cutoff, method = "nearest") cluster_data <- cluster(column_distance, cutoff, method = "average") cluster_data <- cluster(column_distance, cutoff, method = "weighted")
All methods produce a list object with an indicator of the cutoff that was used
(label
), as well as cluster composition (cluster
) and shared (abundance
) data frames.
The clusters
data frame shows which OTU (Operation Taxonomic Unit) each sequence was assigned to. The abundance
data frame
contains columns indicating the OTU
and sample
identifiers and the
abundance of each OTU in each sample. The OptiClust method also includes the metrics
data
frame, which describe the optimization value for each iteration in the fitting
process; the data in clusters
and shared
are taken from the last iteration.
clustur provides getter functions, get_label()
, get_clusters()
,
get_shared()
, and get_metrics()
, which will be demonstrated below.
clusters <- cluster(column_distance, cutoff, method = "opticlust") get_cutoff(clusters) get_bins(clusters) get_abundance(clusters) get_metrics(clusters)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.