clustur
In clustur: Clustering

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Background

clustur was developed to be similar to mothur's cluster function that was written in C++. In order to cluster your data, users need to provide their own sparse or phylip-formatted distance matrix. They also need to provide a count table that either comes from mothur or that they create in R. Once these objects are built users can call the cluster() function. We currently support 5 methods: opticlust (default) and furthest, nearest, weighted, and average neighbor. The opticlust method is cluster() and mothur's default. The speed of the methods implemented in {clustur} and mothur are comparable; {clustur} may even be faster! Below we will show you how to create your sparse matrix and count table. If you do not have a count table, clustur can produce one from you, but it will assume the abundance of each sequence is one and it will only cluster the sequences in the distance matrix. The output of running clustur() includes what is typically provided in a mothur-formatted shared file.

Starting Up

For the official release from CRAN you can use the standard install.packages() function:

# install via cran
install.packages("clustur")

For the developmental version, you can use the install_github() function from the {devtools} package

# install via github
devtools::install_github("SchlossLab/clustur")

Because {clustur}'s functions make use of a random number generator, users are strongly encouraged to set the seed.

library(clustur)
set.seed(19760620)

Read count files

clustur will produce the same output using either a sparse (default) or full count table

full_count_table <- read_count(example_path("amazon.full.count_table"))
sparse_count_table <- read_count(example_path("amazon.sparse.count_table"))

Read distance matrix file

clustur will read both mothur's column/sparse distance matrix and Phylip-formatted distance matrix formats.

column_distance <- read_dist(example_path("amazon_column.dist"), full_count_table, cutoff = 0.03)

phylip_distance <- read_dist(example_path("amazon_phylip.dist"), full_count_table, cutoff = 0.03)

The return value of distance_data will be a memory address. If you want a data frame version of the distances, you can use get_distance_df(distance_data).

get_distance_df(column_distance)
get_distance_df(phylip_distance)

Clustering the data

The default method for clustering in cluster is "opticlust"

cutoff <- 0.03
cluster_data <- cluster(column_distance, cutoff)

Selecting different clustering methods

cluster_data <- cluster(column_distance, cutoff, method = "furthest")
cluster_data <- cluster(column_distance, cutoff, method = "nearest")
cluster_data <- cluster(column_distance, cutoff, method = "average")
cluster_data <- cluster(column_distance, cutoff, method = "weighted")

Output data from clustering

All methods produce a list object with an indicator of the cutoff that was used (label), as well as cluster composition (cluster) and shared (abundance) data frames. The clusters data frame shows which OTU (Operation Taxonomic Unit) each sequence was assigned to. The abundance data frame contains columns indicating the OTU and sample identifiers and the abundance of each OTU in each sample. The OptiClust method also includes the metrics data frame, which describe the optimization value for each iteration in the fitting process; the data in clusters and shared are taken from the last iteration. clustur provides getter functions, get_label(), get_clusters(), get_shared(), and get_metrics(), which will be demonstrated below.

clusters <- cluster(column_distance, cutoff, method = "opticlust")
get_cutoff(clusters)
get_bins(clusters)
get_abundance(clusters)
get_metrics(clusters)