runDaAnalysis: This function performs differential abundance analysis...

View source: R/RepDaAnalysisFns.R

runDaAnalysisR Documentation

This function performs differential abundance analysis between groups of TCRB CDR3 samples (repseq data) to identify differentially abundant (DA) CDR3s.

Description

Function performs clustering based differential abundance analysis of CDR3 sequences in two sample groups with repeat resampling strategy. It first performs within sample unsupervised clustering using subsequence frequency based unsupervised clustering, matches the clusters to their closest match across samples, and performs differential abundance testing at the level of matching clusters to identify differentially abundant condition associated CDR3 sequences

Usage

runDaAnalysis(repSeqObj, clusterby = "NT", kmerWidth = 4, paired = T,
  clusterDaPcutoff = 0.1, positionWt = F, distMethod = c("euclidean",
  "cosine"), useDynamicTreeCut = T, matchingMethod = "km",
  repeatResample = T, nRepeats = 10, resampleSize = 5000,
  useProb = T, returnAll = T, nRR = 1000)

Arguments

repSeqObj

is an object containing all repertoire sample data

clusterby

boolean; subsequence type to consider, either NT (nucleotide) or AA (amino acid)

kmerWidth

subsequence width to use, default is 4 for NT, and 3 for AA clusterby

paired

boolean; whether to perform paired analysis for matched datasets,default is true.

clusterDaPcutoff

sub-repertoire level differential abundance testing cut off, default is 0.1. This works well for our test cases.

positionWt

boolean; whether to use positional weights for kmer frequencies, default is false

distMethod

the distance method to be used for distance calculation between CDR3 feature vectors, use "euclidean" for nt 4-mer, and "cosine" for aa 3-mer feature vectors

useDynamicTreeCut

boolean; default true, uses Dynamic Tree cut algorithm to cut clustering dendrograms. if false, findOptimalK will be used to find optimal k

matchingMethod

matching method to match cluster centroids from all samples to identify subrepertoires; default is km (kmeans). If hc, hierarchical clustering will be used with dynamic tree cut to define clusters, if og an in house algorithm will be used that matches each cluster centroid in first sample to their closest centroids in all samples.

repeatResample

boolean; perform repeat resampling, default is true. If false, all repertoire dataset will be used for analysis without downsampling.

nRepeats

number of repeat resample runs to perform if repeatResample is true, default is 10

resampleSize

the downsampling size in the repeat resample runs. default is 5000

useProb

boolean; if true, probabilistic sampling is performed for downsampling with most frequenty CDR3s being more likely to be resampled. If false, all CDR3s have equal chance of being resampled. Default is true.

returnAll

boolean; if true, the function returns a list whose first and second elements are candidate CDR3s from differentially abundant subrepertoires along with their ranking statistics from enrichment and de-enrichment analyses respectively, the third element contains the directory where all intermediate repeat resample resuls are written. If false, the intermediate results address is not returned.

nRR

the number of permutations to perform in the ranking step of candidate DA CDR3s to determine statistical significance.

analysisName

prefix to the directory name in which intermediate results from resample runs will be written.

Value

a data frame with all candidate DA CDR3s if returnAll is false, a list with data frame of candidate DA CDR3s and address to all intermediate results if returnAll is true.

Examples

results <- runDaAnalysis(repObj,clusterby="NT",kmerWidth=4,paired=T,clusterDaPcutoff=0.1,positionWt = F,distMethod="euclidean",matchingMethod="km",nRepeats=2,resampleSize=1000,useProb=T,returnAll=T,nRR=1000)


dyohanne/RepAn documentation built on Feb. 3, 2023, 2:41 p.m.