sumrep: Summary statistics for B cell receptor (BCR) repertoires

The following table details the expected columns in an annotations data.table. Note that

| Name |--------------------|--- | sequence_alignment | germline_alignment | v_call | d_call | j_call | junction | junction_aa | vj_in_frame | v_3p_del | d_5p_del | d_3p_del | j_5p_del | vd_insertion | dj_insertion | vj_insertion | np1_length | np2_length | clone_id | Type | Description | ------|------------------------------------------------------------------------------------------------------------------------------------------------------------------- | string | Aligned portion of query sequence. By default constrained to variable region, but not required. Synonymous with "mature" sequence in sumrep. | | string | Assembled, aligned, fully length inferred germline sequence spanning the same region as the sequence_alignment field. Synonymous with "naive" sequence in sumrep. | | string | V gene with or without allele. For example, IGHV4-59*01. | | string | D gene with or without allele. For example, IGHD3-10*01. | | string | J gene with or without allele. For example, IGHJ4*02. | | string | Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons. | | string | Junction region amino acid sequence. | | boolean | True if the V and J segment alignments are in-frame. | | integer | Number of nucleotides in the V 3' deletion. | | integer | Number of nucleotides in the D 5' deletion. | | integer | Number of nucleotides in the D 3' deletion. | | integer | Number of nucleotides in the J 5' deletion. | | string | Sequence of the insertion between the V and D segments (for heavy/beta chains). | | string | Sequence of the insertion between the D and J segments (for heavy/beta chains). | | string | Sequence of the insertion between the V and J segments (for light/alpha chains). | | integer | Number of nucleotides between the V and D segments or V and J segments. | | integer | Number of nucleotides between the D and J segments (for heavy/beta chains). | | integer | Clonal familiy cluster assignment for the query sequence. |

Most of these names and definitions come directly from the AIRR standard, with some exceptions and modifications. Not every column is strictly required for sumrep to work (e.g., TCR datasets to not need a clone_id), but you will only be able to use functions for which the required columns are present. See specific function man pages for more details.

The following table details the available distribution retrieval functions in sumrep:

| sumrep function | Summary statistic | Default column(s) | Packages used | Comments | |---------------------------------------------------|-----------------------------------------------------------------------------------------|--------------------------|---------------|-------------------------------------------| | getPairwiseDistanceDistribution | Vector of Levenshtein distances of each sequence to each other sequence | sequence_alignment | stringdist | | | getNearestNeighborDistribution | Vector of nearest neighbor (NN) distances, where the NN distance of a sequence is the minimum Levenshtein distance to each other sequence | sequence_alignment | stringdist | The parameter k can be specified to yield the k th nearest neighbor distribution. k = 1 by default. The approximate NN distribution can only be computed when k = 1. | | getGCContentDistribution | Vector of sequence-wise GC contents | sequence_alignment | ape | | | getHotspotCountDistribution | Vector of sequence-wise hotspot counts | sequence_alignment | Biostrings | "WRC" and "WA" are default hotspot motifs | | getColdspotCountDistribution | Vector of sequence-wise coldspot counts | sequence_alignment | Biostrings | "SYC" are default coldspot motifs | | getCDR3LengthDistribution | Vector of CDR3 lengths, including conserved CDR3 anchors | junction_aa, junction, or junction_length | | | | getCDR3PairwiseDistanceDistribution | Vector of pairwise Levenshtein distances of CDR3 sequences | junction_aa | | | | getAtchleyFactorDistributions | Vector of each of the five Atchley factors | junction_aa | HDMD | | getKideraFactorDistributions | Vector of each of the ten Kidera factors | junction_aa | Peptides | | | getAliphaticIndexDistribution | Vector of sequence-wise aliphatic indices | junction_aa | Peptides | | | getGRAVYDistribution | Vector of GRAVY indices | junction_aa | alakazam | | | getPolarityDistribution | Vector of sequence-wise polarity values | junction_aa | alakazam | | | getChargeDistribution | Vector of sequence-wise charge values | junction_aa | alakazam | | | getBasicityDistribution | Vector of sequence-wise basicity values | junction_aa | alakazam | | | getAcidityDistribution | Vector of sequence-wise acidity values | junction_aa | alakazam | | | getAromaticityDistribution | Vector of sequence-wise aromaticity values | junction_aa | alakazam | | | getBulkinessDistribution | Vector of sequence-wise bulkiness values | junction_aa | alakazam | | | getPerGeneMutationRates | List of mutation rates of each observed germline gene | N/A | | | | getPerGenePerPositionMutationRates | List of mutation rate vectors over each position, over each observed germline gene | N/A | | | | getSubstitutionModel | Inferred substitution matrix for somatically hypermutated sequences | sequence_alignment, germline_alignment, v_call | shazam | | | getMutabilityModel | Inferred mutability matrix for somatically hypermutated sequences | sequence_alignment, germline_alignment, v_call | shazam | | | getPositionalDistanceBetweenMutationsDistribution | Vector of positional distances between mutations over all sequences | sequence_alignment, germline_alignment | | Defined only for sequence reads with two or more mutations from the inferred germline ancestor | | getDistanceFromGermlineToSequenceDistribution | Vector of Levenshtein distances from germline_alignment to sequence_alignment | sequence_alignment, germline_alignment | stringdist | | | getVGene3PrimeDeletionLengthDistribution | Vector of V 3' intron lengths | v_3p_del | | | | getVGene5PrimeDeletionLengthDistribution | Vector of V 5' intron lengths | v_5p_del | | | | getDGene3PrimeDeletionLengthDistribution | Vector of D 3' intron lengths | d_3p_del | | | | getDGene5PrimeDeletionLengthDistribution | Vector of D 5' intron lengths | d_5p_del | | | | getJGene3PrimeDeletionLengthDistribution | Vector of J 3' intron lengths | j_3p_del | | | | getJGene5PrimeDeletionLengthDistribution | Vector of J 5' intron lengths | j_5p_del | | | | getVDInsertionLengthDistribution | Vector of VD exon lengths | np1_length | | | | getDJInsertionLengthDistribution | Vector of DJ exon lengths | np2_length | | | | getVJInsertionLengthDistribution | Vector of VJ exon lengths | np1_length | | | | getVDInsertionMatrix | Empirical transition matrix for VD exons | vd_insertion | | | | getDJInsertionMatrix | Empirical transition matrix for DJ exons | dj_insertion | | | | getVJInsertionMatrix | Empirical transition matrix for VJ exons | vj_insertion | | | | getInFramePercentage | Percentage of sequences whose V and J regions are in-frame | vj_in_frame | | | | getClusterSizeDistribution | Vector of clonal family cluster sizes | clone_id | | | | getHillNumbers | Vector of Hill numbers of the supplied diversity orders of clonal family clusters | clone_id | alakazam | | | getSelectionEstimate | Vector of estimated selection strengths of clonal family clusters | sequence_alignment, germline_alignment | shazam | This method uses shazam::calcBaseline to compute the BASELINe posterior density for estimating selection | | getSackinIndex | Scalar Sackin balance index of a tree | N/A | CollessLike | | | getCollessLikeIndex | Scalar Colless-like balance index of a tree | N/A | CollessLike | | | getCopheneticIndex | Scalar Cophenetic index/correlation of a tree | N/A | CollessLike | |

In general, these summaries are grouped into hierarchical levels, whose assumptions are described in the following table.

| Level | Assumptions | Main expected column(s) | | ----- | ----------- | ------- | | 0 | None | sequence | | 1 | Pairwise alignment | sequence_alignment | | 2 | Annotations | germline_alignment, junction_aa | | 3 | Clonal family clustering | clone_id | | 4 | Phylogenies (BCR only) | N/A |

The various indel statistics tabulated above depend on the locus (e.g., whether or not D gene statistics are relevant) and fall into the Level 2 category. These levels are hierarchical in that each level depends on the assumptions of the previous level. For example, to obtain annotations, you would have first needed to pairwise align the sequences. Level 0 includes raw query sequences and, while supported, does not comprise the default level for any sumrep function. Level 4 includes tree statistics for clonal family trees of BCR sequences; while sumrep currently contains some tree functions, these functions have not yet been tested on experimental data or incorporated in comparison routines.

Gene usage summaries and comparisons

Comparisons involving gene usage distributions, in particular compareVGeneDistributions, compareDGeneDistributions, compareJGeneDistributions, and compareVDJDistributions, do not have corresponding getter functions in the form of getXDistribution. This is due to the logic behind computing these divergences, which relies on contingency (frequency) tables rather than distribution vectors like the other summaries. The following table details the analogous getter functions driving these comparison functions.

| sumrep comparison function | sumrep getter function | Default column(s) | | ---------------------------- | ------------------------ | ----------------- | | compareVGeneDistributions | compareGermlineGeneDistributions | v_call | | compareDGeneDistributions | compareGermlineGeneDistributions | d_call | | compareJGeneDistributions | compareGermlineGeneDistributions | j_call | | compareVDJDistributions | compareJointGeneDistributions | v_call, d_call, j_call |