View source: R/public_clusters.R
findPublicClusters | R Documentation |
Part of the workflow Searching for Public TCR/BCR Clusters.
Given multiple samples of bulk Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data, construct the repertoire network for each sample. Within each sample's network, perform cluster analysis and filter the clusters based on node count and aggregate clone count.
findPublicClusters(
## Input ##
file_list,
input_type,
data_symbols = NULL,
header, sep, read.args,
sample_ids =
paste0("Sample", 1:length(file_list)),
seq_col,
count_col = NULL,
## Search Criteria ##
min_seq_length = 3,
drop_matches = "[*|_]",
top_n_clusters = 20,
min_node_count = 10,
min_clone_count = 100,
## Optional Visualization ##
plots = FALSE,
print_plots = FALSE,
plot_title = "auto",
color_nodes_by = "cluster_id",
## Output ##
output_dir,
output_type = "rds",
## Optional Output ##
output_dir_unfiltered = NULL,
output_type_unfiltered = "rds",
verbose = FALSE,
...
)
file_list |
A character vector of file paths, or a list containing
|
input_type |
A character string specifying the file format of the sample data files. Options
are |
data_symbols |
Used when |
header |
For values of |
sep |
For values of |
read.args |
For values of |
sample_ids |
A character or numeric vector of sample IDs, whose length matches that of
|
seq_col |
Specifies the column of each sample's data frame containing the TCR/BCR sequences. Accepts a character string containing the column name or a numeric scalar containing the column index. |
count_col |
Specifies the column of each sample's data frame containing the clone count
(measure of clonal abundance).
Accepts a character string containing the column name
or a numeric scalar containing the column index.
If |
min_seq_length |
Passed to |
drop_matches |
Passed to |
top_n_clusters |
The number of clusters from each sample to be automatically be included among the filtered clusters, based on greatest node count. |
min_node_count |
Clusters with at least this many nodes will be included among the filtered clusters. |
min_clone_count |
Clusters with an aggregate clone count of at least this value will be included
among the filtered clusters. A value of |
plots |
Passed to |
print_plots |
Passed to |
plot_title |
Passed to |
color_nodes_by |
Passed to |
output_dir |
The file path of the directory for saving the output. The directory will be created if it does not already exist. |
output_type |
A character string specifying the file format to use for saving the output.
Valid options include |
output_dir_unfiltered |
An optional directory for saving the unfiltered network data for each sample. By default, only the filtered results are saved. |
output_type_unfiltered |
A character string specifying the file format to use for saving the unfiltered
network data for each sample. Only applicable if |
verbose |
Logical. If |
... |
Other arguments to |
Each sample's network is constructed using an individual call to
buildNet()
with
node_stats = TRUE
, stats_to_include = "all"
,
cluster_stats = TRUE
and cluster_id_name = "ClusterIDInSample"
.
The node-level properties are renamed to reflect their
correspondence to the sample-level network. Specifically, the properties are named:
SampleLevelNetworkDegree
SampleLevelTransitivity
SampleLevelCloseness
SampleLevelCentralityByCloseness
SampleLevelCentralityByEigen
SampleLevelEigenCentrality
SampleLevelBetweenness
SampleLevelCentralityByBetweenness
SampleLevelAuthorityScore
SampleLevelCoreness
SampleLevelPageRank
A variable SampleID
is added to both the node-level and cluster-level meta data for each sample.
After the clusters in each sample are filtered, the node-level and cluster-level
metadata are saved in the respective subdirectories node_meta_data
and
cluster_meta_data
of the output directory specified by output_dir
.
The unfiltered network results for each sample can also be saved by supplying a
directory to output_dir_unfiltered
, if these results are desired for
downstream analysis. Each sample's unfiltered network results will then be saved
to its own subdirectory created within this directory.
The files containing the node-level metadata for the filtered clusters can be
supplied to buildPublicClusterNetwork()
in order to construct a global
network of public clusters. If the full global network is too large to practically
construct, the files containing the cluster-level meta data for the filtered
clusters can be supplied to
buildPublicClusterNetworkByRepresentative()
to build a global network using only a single representative sequence from each
cluster. This allows prominent public clusters to still be identified.
See the Searching for Public TCR/BCR Clusters article on the package website.
Returns TRUE
, invisibly.
Brian Neal (Brian.Neal@ucsf.edu)
Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825
Searching for Public TCR/BCR Clusters vignette
buildPublicClusterNetwork()
buildPublicClusterNetworkByRepresentative()
set.seed(42)
## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
"CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
"CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
"CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
"CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
"CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
"CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
"CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
"CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
"CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
"CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
stats::toeplitz(0.6^(0:(sample_size - 1))),
matrix(1, nrow = samples, ncol = length(base_seqs) - samples)
)
simulateToyData(
samples = samples,
sample_size = sample_size,
prefix_length = 1,
prefix_chars = c("", ""),
prefix_probs = cbind(rep(1, samples), rep(0, samples)),
affixes = base_seqs,
affix_probs = pgen,
num_edits = 0,
output_dir = tempdir(),
no_return = TRUE
)
sample_files <-
file.path(tempdir(),
paste0("Sample", 1:samples, ".rds")
)
findPublicClusters(
file_list = sample_files,
input_type = "rds",
seq_col = "CloneSeq",
count_col = "CloneCount",
min_seq_length = NULL,
drop_matches = NULL,
top_n_clusters = 3,
min_node_count = 5,
min_clone_count = 15000,
output_dir = tempdir()
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.