knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", out.width = "100%", fig.width = 9, fig.height = 7 )
The NAIR
package includes a set of functions that facilitate searching for public TCR/BCR clusters across multiple samples of Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data.
In this context, a public cluster consists of similar TCR/BCR clones (e.g., those whose CDR3 amino acid sequences differ by at most one amino acid) that are shared across samples (e.g., across individuals or across time points for a single individual).
We simulate some toy data for demonstration.
Our toy data includes 30 samples, each containing 30 observations.
Some sequences are simulated with a tendency to appear in relatively few samples, while others are simulated with a tendency to appear in many samples.
set.seed(42) library(NAIR) data_dir <- tempdir() dir_input_samples <- file.path(data_dir, "input_samples") dir.create(dir_input_samples, showWarnings = FALSE) samples <- 30 sample_size <- 30 # (seqs per sample) base_seqs <- c( "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF", "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF", "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF", "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF", "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF", "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF", "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF", "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF", "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF", "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF") # relative generation probabilities pgen <- cbind( stats::toeplitz(0.6^(0:(sample_size - 1))), matrix(1, nrow = samples, ncol = length(base_seqs) - samples)) simulateToyData( samples = samples, sample_size = sample_size, prefix_length = 1, prefix_chars = c("", ""), prefix_probs = cbind(rep(1, samples), rep(0, samples)), affixes = base_seqs, affix_probs = pgen, num_edits = 0, output_dir = dir_input_samples, no_return = TRUE )
Each sample's data frame is saved to its own file using the RDS file format. The files are named "Sample1.rds
", "Sample2.rds
", etc. A character string containing the directory path is assigned to the R environment variable dir_input_samples
for later reference.
The first few rows of the data for the first sample appear as follows:
# View first few rows of data for sample 1 head(readRDS(file.path(dir_input_samples, "Sample1.rds")))
First, we use findPublicClusters()
to search across samples and select clones for inclusion in the global network.
Each sample's repertoire network is constructed individually, and cluster analysis is used to partition each network into clusters. The clusters are then filtered according to node count and clone count based on user-specified criteria. The AIRR-Seq data for the clusters that remain after filtering is saved to files to be used as inputs for step 2.
Below, we explain how to use findPublicClusters()
.
Each sample's AIRR-Seq data must be contained in a separate file, with observations indexed by row, and with the same columns across samples.
The file_list
parameter accepts a character vector containing file paths (or a list containing file paths and connections), where each element corresponds to a file containing a single sample.
# create vector of input file paths for step 1 (one per sample) input_files <- file.path(dir_input_samples, paste0("Sample", 1:samples, ".rds") ) head(input_files)
The file format of the input files is specified by the input_type
parameter. The supported values are "rds"
, "rda"
, "csv"
, "csv2"
, "tsv"
and "table"
. Depending on the input type, further options are specified with data_symbols
or read.args
.
Refer here and to loadDataFromFileList()
for details and examples.
The seq_col
parameter specifies the column containing the TCR/BCR sequences within each sample. It accepts the column name (as a character string) or the column position index.
The optional count_col
parameter specifies the column containing the clone count (clonal abundance) within each sample. It accepts the column name (as a character string) or the column position index. If provided, clone counts will be considered when filtering the clusters.
Each clone's sample ID is included in the output. By default, these are "Sample1"
, "Sample2"
, etc., according to the order in file_list
.
The optional sample_ids
parameter assigns custom sample IDs. It accepts a vector of the same length as file_list
, where each entry is the corresponding sample ID.
The clones from each sample are filtered to remove any irrelevant data. By default, clones with sequences that are less than three characters in length, as well as sequences containing any of the characters *
, _
or |
, will be excluded. The min_seq_length
and drop_matches
parameters control the filter settings. Refer here for details.
The parameters that control the construction of each sample's network are shown below along with their default values.
dist_type = "hamming"
dist_cutoff = 1
drop_isolated_nodes = TRUE
Refer here for their meaning and usage.
By default, clustering within each sample's network is performed using igraph::cluster_fast_greedy()
. A different clustering algorithm can be specified using the cluster_fun
parameter, as described here.
The following parameters control the criteria used to select clusters from each sample.
Within each sample, the $n = 20$ clusters with the greatest node count are automatically selected. The value of $n$ can be adjusted using the top_n_clusters
parameter.
By default, any cluster containing at least ten nodes will be selected This value can be adjusted using the min_node_count
parameter.
By default, any cluster with an aggregate clone count (summed over all nodes) of at least 100 will be selected. This value can be adjusted using the min_clone_count
parameter.
This criterion only applies if clone counts are provided using the count_col
parameter.
findPublicClusters()
does not return any direct output. Instead, data for the selected clusters is saved to files to be used as inputs in step 2. The following parameters control the output settings.
By default, the output includes all variables from the original sample data. These variables can be used later as metadata in visualizations of the global network.
To keep only a subset of the original variables, specify the variables to keep using the subset_cols
parameter, which accepts a character vector of column names or a vector of column indices. The sequence column is always included.
The output_dir
parameter specifies the output directory. It accepts a character string containing the directory path. The directory will be created if it does not exist.
# create output directory path for step 1 dir_filtered_samples <- file.path(data_dir, "filtered_samples")
By default, each file is saved as an RDS file. This can be changed using the output_type
parameter. Other accepted values are "rda"
and "csv"
.
By default, findPublicClusters()
saves data only for the selected clusters from each sample. If desired, data for each sample's entire network can also be saved by passing a directory path to the output_dir_unfiltered
parameter. The full network data for each sample is the output returned by buildNet()
. The output_type_unfiltered
parameter specifies the file format in the same manner described [here]((https://mlizhangx.github.io/Network-Analysis-for-Repertoire-Sequencing-/articles/buildRepSeqNetwork.html#output-file-format) for the output_type
parameter of buildNet()
.
By default, findPublicClusters()
does not produce visual plots. The visualization of interest is of the global network in step 2.
A plot of each sample's full network can be produced using plots = TRUE
. Specifying print_plots = TRUE
prints these to the R plotting window. The plots will be saved if output_dir_unfiltered
is non-null. By default, the nodes in each plot are colored according to cluster membership. A different variable can be specified using the color_nodes_by
parameter as detailed here (or here for multiple variables).
Refer here to learn about other parameters for customizing the visualization.
findPublicClusters(input_files, input_type = "rds", seq_col = "CloneSeq", count_col = "CloneCount", min_seq_length = NULL, drop_matches = NULL, top_n_clusters = 3, min_node_count = 5, min_clone_count = 15000, output_dir = dir_filtered_samples )
Two new directories are created within the specified output directory:
list.files(dir_filtered_samples)
These directories contain cluster-level and node-level metadata, respectively, for the selected clusters from each sample. We require only the node metadata for step 2.
head(list.files(file.path(dir_filtered_samples, "node_meta_data")))
buildPublicClusterNetwork()
combines the selected clusters from all samples into a single global network, where a new round of cluster analysis is performed to partition the global network into clusters.
The input files for buildPublicClusterNetwork()
are the node metadata files from the output of step 1. Each file contains data for one sample.
The file_list
parameter accepts a character vector of file paths for the input files, which are located in the node_meta_data
subdirectory of the output directory from step 1.
# Directory of node metadata from step 1 dir_filtered_samples_node <- file.path(dir_filtered_samples, "node_meta_data") # Vector of file paths to node metadata from step 1 files_filtered_samples_node <- list.files(dir_filtered_samples_node, full.names = TRUE)
If findPublicClusters()
was called with a non-default value of output_type
, this value must be passed to the input_type
parameter of buildPublicClusterNetwork()
.
The seq_col
and count_col
parameters specify the input data columns containing receptor sequences and clone counts, respectively. Users should pass the same argument values to these parameters as they did when calling findPublicClusters()
during step 1.
The parameters that control construction of the global network are shown below along with their default values.
dist_type = "hamming"
dist_cutoff = 1
drop_isolated_nodes = FALSE
Refer here for their meaning and usage.
A clustering algorithm is used to partition the global network graph into densely-connected subgraphs (clusters). Each cluster can contain clones from different samples.
By default, clustering within is performed using igraph::cluster_fast_greedy()
. A different clustering algorithm can be specified using the cluster_fun
parameter, as described here.
By default, buildPublicClusterNetwork()
produces a visual plot of the global network graph with the nodes colored according to sample ID.
The color_nodes_by
parameter specifies the variable used to color the nodes. It accepts a character string naming a variable kept from the original sample data or one of the node-level network properties listed here.
color_nodes_by
also accepts a vector naming multiple variables. One plot will be created for each entry, with the nodes colored according to the respective variable.
Refer here to learn about other parameters for customizing the visualization.
buildPublicClusterNetwork()
returns a list containing plots, metadata and other network objects, with the same structure as the output of buildRepSeqNetwork()
.
The output can be saved to a local directory using the parameters output_dir
, output_type
and output_name
, whose usage is described here.
public_clusters <- buildPublicClusterNetwork(files_filtered_samples_node, seq_col = "CloneSeq", count_col = "CloneCount", size_nodes_by = 1, print_plots = TRUE )
The returned list contains the following elements:
names(public_clusters)
The elements are described here. We inspect the node metadata and cluster metadata.
The list element node_data
is a data frame containing metadata for the network nodes, where each row represents a distinct clone corresponding to a node in the global network graph.
nrow(public_clusters$node_data)
# variables in the node-level metadata names(public_clusters$node_data)
All variables kept from the original sample data during step 1 are present. The variable ClusterIDPublic
contains the global cluster membership, while ClusterIDInSample
contains the in-sample cluster membership. Node-level network properties are also present. Those beginning with SampleLevel
correspond to the sample networks, while those beginning with Public
correspond to the global network.
# View some of the node metadata for the global network view_cols <- c("CloneSeq", "SampleID", "ClusterIDInSample", "ClusterIDPublic") public_clusters$node_data[49:54 , view_cols]
The row names indicate the original row ID of each clone within its sample's data.
The list element cluster_data
is a data frame containing metadata for the public clusters, where each row corresponds to a cluster in the global network.
# variables in the cluster-level metadata names(public_clusters$cluster_data)
Refer here for more information about the cluster-level network properties.
# View some of the cluster metadata for the global network head(public_clusters$cluster_data[, 1:6])
After calling buildPublicClusterNetwork()
, the following tasks can be performed using the returned output.
In order to more easily cross-reference the clusters in the visual plot with the clusters in the data, we can label the clusters with their ID numbers.
This is accomplished using labelClusters()
as described here.
Below, we label the six largest clusters in the plot with their cluster IDs. The node metadata variable ClusterIDPublic
contains the global cluster membership, so we pass its name to the cluster_id_col
parameter.
public_clusters <- labelClusters(public_clusters, top_n_clusters = 6, cluster_id_col = "ClusterIDPublic", size = 7 ) public_clusters$plots[[1]]
To focus on a particular cluster, we can subset the node metadata based on the value of ClusterIDPublic
and use buildNet()
to produce plots of the cluster's graph.
# focus on cluster 1 buildNet( public_clusters$node_data[public_clusters$node_data$ClusterIDPublic == 1, ], "CloneSeq", color_nodes_by = "CloneSeq", size_nodes_by = 3, output_name = "Cluster 1", print_plots = TRUE )
# focus on cluster 6 buildNet( public_clusters$node_data[public_clusters$node_data$ClusterIDPublic == 6, ], "CloneSeq", color_nodes_by = "CloneSeq", color_scheme = "plasma", size_nodes_by = 4, output_name = "Cluster 6", print_plots = TRUE )
# clean up temp directory file.remove( file.path(dir_input_samples, paste0("Sample", 1:samples, ".rds") ) ) unlink( c(dir_filtered_samples_node, file.path(dir_filtered_samples, "cluster_meta_data") ), recursive = TRUE )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.