findPublicClusters: Find Public Clusters Among RepSeq Samples

View source: R/public_clusters.R

findPublicClustersR Documentation

Find Public Clusters Among RepSeq Samples

Description

Part of the workflow Searching for Public TCR/BCR Clusters.

Given multiple samples of bulk Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data, construct the repertoire network for each sample. Within each sample's network, perform cluster analysis and filter the clusters based on node count and aggregate clone count.

Usage

findPublicClusters(

  ## Input ##
  file_list,
  input_type,
  data_symbols = NULL,
  header, sep, read.args,
  sample_ids =
    paste0("Sample", 1:length(file_list)),
  seq_col,
  count_col = NULL,

  ## Search Criteria ##
  min_seq_length = 3,
  drop_matches = "[*|_]",
  top_n_clusters = 20,
  min_node_count = 10,
  min_clone_count = 100,

  ## Optional Visualization ##
  plots = FALSE,
  print_plots = FALSE,
  plot_title = "auto",
  color_nodes_by = "cluster_id",

  ## Output ##
  output_dir,
  output_type = "rds",

  ## Optional Output ##
  output_dir_unfiltered = NULL,
  output_type_unfiltered = "rds",

  verbose = FALSE,

  ...

)

Arguments

file_list

A character vector of file paths, or a list containing connections and file paths. Each element corresponds to a single file containing the data for a single sample. Passed to loadDataFromFileList().

input_type

A character string specifying the file format of the sample data files. Options are "table", "txt", "tsv", "csv", "rds" and "rda". Passed to loadDataFromFileList().

data_symbols

Used when input_type = "rda". Specifies the name of each sample's data frame within its respective Rdata file. Passed to loadDataFromFileList().

header

For values of input_type other than "rds" and "rda", this argument can be used to specify a non-default value of the header argument to read.table(), read.csv(), etc.

sep

For values of input_type other than "rds" and "rda", this argument can be used to specify a non-default value of the sep argument to read.table(), read.csv(), etc.

read.args

For values of input_type other than "rds" and "rda", this argument can be used to specify non-default values of optional arguments to read.table(), read.csv(), etc. Accepts a named list of argument values. Values of header and sep in this list take precedence over values specified via the header and sep arguments.

sample_ids

A character or numeric vector of sample IDs, whose length matches that of file_list. The values should be valid for use as filenames and should avoid using the forward slash or backslash characters (/ or \).

seq_col

Specifies the column of each sample's data frame containing the TCR/BCR sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.

count_col

Specifies the column of each sample's data frame containing the clone count (measure of clonal abundance). Accepts a character string containing the column name or a numeric scalar containing the column index. If NULL, the clusters in each sample's network will be selected solely based upon node count.

min_seq_length

Passed to buildRepSeqNetwork() when constructing the network for each sample.

drop_matches

Passed to buildRepSeqNetwork() when constructing the network for each sample. Accepts a character string containing a regular expression (see regex). Checks TCR/BCR sequences for a pattern match using grep(). Those returning a match are dropped. By default, sequences containing any of the characters *, | or _ are dropped.

top_n_clusters

The number of clusters from each sample to be automatically be included among the filtered clusters, based on greatest node count.

min_node_count

Clusters with at least this many nodes will be included among the filtered clusters.

min_clone_count

Clusters with an aggregate clone count of at least this value will be included among the filtered clusters. A value of NULL ignores this criterion and does not select additional clusters based on clone count.

plots

Passed to buildRepSeqNetwork() when constructing the network for each sample.

print_plots

Passed to buildRepSeqNetwork() when constructing the network for each sample.

plot_title

Passed to buildRepSeqNetwork() when constructing the network for each sample.

color_nodes_by

Passed to buildRepSeqNetwork() when constructing the network for each sample.

output_dir

The file path of the directory for saving the output. The directory will be created if it does not already exist.

output_type

A character string specifying the file format to use for saving the output. Valid options include "csv", "rds" and "rda".

output_dir_unfiltered

An optional directory for saving the unfiltered network data for each sample. By default, only the filtered results are saved.

output_type_unfiltered

A character string specifying the file format to use for saving the unfiltered network data for each sample. Only applicable if output_dir_unfiltered is non-null. Passed to buildRepSeqNetwork() when constructing the network for each sample.

verbose

Logical. If TRUE, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to stderr().

...

Other arguments to buildRepSeqNetwork when constructing the network for each sample, not including node_stats, stats_to_include, cluster_stats, cluster_id_name or output_name (see details).

Details

Each sample's network is constructed using an individual call to buildNet() with node_stats = TRUE, stats_to_include = "all", cluster_stats = TRUE and cluster_id_name = "ClusterIDInSample". The node-level properties are renamed to reflect their correspondence to the sample-level network. Specifically, the properties are named:

  • SampleLevelNetworkDegree

  • SampleLevelTransitivity

  • SampleLevelCloseness

  • SampleLevelCentralityByCloseness

  • SampleLevelCentralityByEigen

  • SampleLevelEigenCentrality

  • SampleLevelBetweenness

  • SampleLevelCentralityByBetweenness

  • SampleLevelAuthorityScore

  • SampleLevelCoreness

  • SampleLevelPageRank

A variable SampleID is added to both the node-level and cluster-level meta data for each sample.

After the clusters in each sample are filtered, the node-level and cluster-level metadata are saved in the respective subdirectories node_meta_data and cluster_meta_data of the output directory specified by output_dir.

The unfiltered network results for each sample can also be saved by supplying a directory to output_dir_unfiltered, if these results are desired for downstream analysis. Each sample's unfiltered network results will then be saved to its own subdirectory created within this directory.

The files containing the node-level metadata for the filtered clusters can be supplied to buildPublicClusterNetwork() in order to construct a global network of public clusters. If the full global network is too large to practically construct, the files containing the cluster-level meta data for the filtered clusters can be supplied to buildPublicClusterNetworkByRepresentative() to build a global network using only a single representative sequence from each cluster. This allows prominent public clusters to still be identified.

See the Searching for Public TCR/BCR Clusters article on the package website.

Value

Returns TRUE, invisibly.

Author(s)

Brian Neal (Brian.Neal@ucsf.edu)

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Searching for Public TCR/BCR Clusters vignette

See Also

buildPublicClusterNetwork()

buildPublicClusterNetworkByRepresentative()

Examples

set.seed(42)

## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples)
)
simulateToyData(
  samples = samples,
  sample_size = sample_size,
  prefix_length = 1,
  prefix_chars = c("", ""),
  prefix_probs = cbind(rep(1, samples), rep(0, samples)),
  affixes = base_seqs,
  affix_probs = pgen,
  num_edits = 0,
  output_dir = tempdir(),
  no_return = TRUE
)

sample_files <-
  file.path(tempdir(),
            paste0("Sample", 1:samples, ".rds")
  )
findPublicClusters(
  file_list = sample_files,
  input_type = "rds",
  seq_col = "CloneSeq",
  count_col = "CloneCount",
  min_seq_length = NULL,
  drop_matches = NULL,
  top_n_clusters = 3,
  min_node_count = 5,
  min_clone_count = 15000,
  output_dir = tempdir()
)




mlizhangx/Network-Analysis-for-Repertoire-Sequencing- documentation built on April 7, 2024, 12:02 p.m.