cluster: cluster

View source: R/clustering.R

clusterR Documentation

cluster

Description

Given a sample-by-feature matrix and sample-associated metadata including their biological condition groupings, cluster samples hierarchically and use external cluster validity measures (Adjusted Rand Index, Normalized Mutual Information, and V measure) to assess the agreement between the inferred clusters and the biological conditions. Optionally, produce a heatmap reflecting the hierarchical clustering result.

Usage

cluster(ft_mat, metadata, query, heatmap = FALSE, title = NULL,
  outdir = NULL, optimal_clusters = TRUE, n_features = FALSE,
  estimate_state = FALSE, method = NULL, test_condition = NULL,
  signal_col = NULL, mark = NULL)

Arguments

ft_mat

matrix where columns are features and rows are samples as returned by summarizePeaks or binarizePeaks

metadata

A dataframe with a column "Sample" which stores the sample identifiers, and a column "Condition", which stores the biological condition labels of the samples

query

GRanges object specifying the query region

heatmap

(Optional) Logical value indicating whether to plot the heatmap for hierarchical clustering. Default: FALSE

title

(Optional) If heatmap is TRUE, specify the title of the plot, which will also be used for the output file name in PDF format

outdir

(Optional) String specifying the name of the directory where PDF of heatmaps should be saved

optimal_clusters

(Optional) Logical value indicate whether to cluster samples into two groups, or to find the optimal clustering solution by choosing the set of clusters which maximizes the Average Silhouette width. Default: TRUE

n_features

(Optional) Logical value indicating whether to include a column "n_features" in the output storing the number of features in the feature matrix constructed for the region, which may be useful for understanding the behaviour of the binary strategy for constructing feature matrices. Default: FALSE

estimate_state

(Optional) Logical value indicating whether to include a column "state" in the output specifying the estimated chromatin state of a test condition. The state will be on of "ON", "OFF", or NA, where the latter results if a binary switch between the conditions is unclear. Default: FALSE.

method

(Optional) If estimate_state is TRUE, one of "summary" or "binary", specifying which method was used to construct the feature matrix in ft_mat

test_condition

(Optional) If estimate_state is TRUE, string specifying one of the two biological condtions in metadata$Condition for which to estimate chromatin state.

signal_col

(Optional) If estimate_state is TRUE, and method is "summary", string specifying the name of the column in the original peak files which corresponds to the level of enrichment in the region, e.g. fold change

mark

(Optional) If estimate_state is TRUE, and method is "summary",string specifying the name of the mark for which ft_mat was constructed

Value

A dataframe with the region, the number of clusters inferred, the cluster validity statistics, and the cluster assignments for each sample

Examples

samples <- c("E068", "E071", "E074", "E101", "E102", "E110")
bedfiles <- system.file("extdata", paste0(samples, ".H3K4me3.bed"),
package = "chromswitch")
Conditions <- c(rep("Brain", 3), rep("Other", 3))

metadata <- data.frame(Sample = samples,
    H3K4me3 = bedfiles,
    Condition = Conditions,
    stringsAsFactors = FALSE)

region <- GRanges(seqnames = "chr19",
    ranges = IRanges(start = 54924104, end = 54929104))

lpk <- retrievePeaks(H3K4me3,
    metadata = metadata,
    region = region)

ft_mat <- summarizePeaks(lpk, mark = "H3K4me3",
cols = c("qValue", "signalValue"))

cluster(ft_mat, metadata, region)

# Estimate the state of the test condition, "Brain"
cluster(ft_mat, metadata, region,
    estimate_state = TRUE,
    method = "summary",
    signal_col = "signalValue",
    mark = "H3K4me3",
    test_condition = "Brain")


selinj/chromswitch documentation built on Jan. 27, 2024, 12:36 p.m.