bcRep: Advanced Analysis of B Cell Receptor Repertoire Data
In bcRep: Advanced Analysis of B Cell Receptor Repertoire Data

title: 'bcRep: Advanced Analysis of B Cell Receptor Repertoire Data' author: "Julia Bischof Julia.Bischof@uksh.de" date: 'r Sys.Date()' output: pdf_document: toc: yes toc_depth: 3 html_document: theme: united toc: yes word_document: default

Introduction

The bcRep package helps to analyze IMGT/HighV-QUEST* output in more detail. It functions well with B cells, but can also be used for T cell data, in some cases.
Using this package IMGT/HighV-QUEST output files can be analysed, f.e. functionality, junction frames, gene usage and mutations. The package provides functions to analyze clones out of the IMGT/HighV-QUEST output, but also methods to compare sequences and clones, are provided. Further a function to do principal coordinate analysis on dissimilarity/distance measurements is implemented. Almost all results can be visualized via implemented plotting functions.

*: Alamyar, E. et al., IMGT/HighV-QUEST: A High-Throughput System and Web Portal for the Analysis of Rearranged Nucleotide Sequences of Antigen Receptors, JOBIM 2010, Paper 63 (2010). Website: http://www.imgt.org/

Package features

Handling IMGT/HighV-QUEST output: combine several output folders; read IMGT tables
Data manipulation: filtering sequences or clones for functionality or junction frames
Descriptive statistics: functionality and junction frame usage; CDR3 length distribution
Clonotype analysis: cluster sequences to clones
Gene usage analysis
Basic mutation analysis; proportions of amino acid mutations; proportions of bases nearby mutations
Amino acid distribution and diversity: amino acid distribution, richness, Shannon index, inverse Shannon index, Gini index
Comparison of different samples: gene usage, amino acid distribution, diversity, clones
Distances and Principal coordinate analysis on gene usage and sequence data
several visualization methods

General information and package data

bcRep depends on following packages: vegan, gplots, ineq, parallel, doParallel, foreach, ape, stringdist.

Parallel processing

Parallel processing is possible for functions clones(), clones.shared(), sequences.geneComb(), compare.aaDistribution(), compare.geneUsage() and compare.trueDiversity(). By default, 1 core is used.

Datasets

There are several datasets provided in the package. Most of them are examples of IMGT/HighV-QUEST output; two additional files represent clonotype files and one gene usage data:

library(bcRep)
data(summarytab) # An extract from IMGT/HighV-QUEST output file 1_Summary(...).txt
data(ntseqtab) # An extract from IMGT/HighV-QUEST output file 3_Nt-sequences(...).txt
data(aaseqtab) # An extract from IMGT/HighV-QUEST output file 5_AA-sequences(...).txt
data(mutationtab) # An extract from IMGT/HighV-QUEST output file 
                  # 7_V-REGION-mutation-and-AA-change-table(...).txt

data(clones.ind) # Clonotypes of one individual
data(clones.allind) # Clonotypes of eight individuals

data(vgenes) # V gene usage of 10 individuals (rows) and 30 genes (columns)

Processing of IMGT/HighV-QUEST data

IMGT/HighV-QUEST can process datasets with up to 500.000 sequences. For larger data sets, FASTA files have to be splitted into smaller datasets and be uploaded individually to IMGT. Afterwards the function combineIMGT() can be used to combine several folders.

# Example: combine folders IMGT1a, IMGT1b and IMGT1c to a new datset NewProject
combineIMGT(folders = c("pathTo/IMGT1a", "pathTo/IMGT1b", "pathTo/IMGT1c"), 
            name = "NewProject")

To read IMGT/HighV-QUEST output files, use readIMGT("PathTo/file.txt",filterNoResults=TRUE). You can choose between including or excluding sequences without any information (lines with a sequence ID, but "No results"). Spaces and "-" in headers will be replaced by "_". If there is a particular IMGT table required as input for a function, this is mentioned in the corresponding help file.

Amino acid distribution

Amino acid distribution can be analyzed using aaDistribution() and visualized with plotAADistribution().

This function returns a list containing proportions of all amino acids (including stop codons "*") for each analyzed sequence length. Optionally, also the number used for analysis can be given.

Figures can be saved as PDF in the working directory.

library(bcRep)
library(pander)
```r
# Example: 
data(aaseqtab)
aadistr<-aaDistribution(sequences = aaseqtab$CDR3_IMGT, numberSeq = TRUE)

# First 4 columns of Amino acid distribution table: 
        # for a sequence length of 13 AA (* = stop codon)
        # (aadistr$Amino_acid_distribution$`sequence length = 13`)

```r
library(pander)
pandoc.table(aadistr$Amino_acid_distribution$`sequence length = 13`[, 1:4])

```r
# Plot example for sequence lengths of 14-17 amino acids:
aadistr.part<-list(aadistr$Amino_acid_distribution[13:16], 
                   data.frame(aadistr$Number_of_sequences_per_length[13:16,]))
names(aadistr.part)<-names(aadistr)
plotAADistribution(aaDistribution.tab=aadistr.part, plotAADistr=TRUE, 
                   plotSeqN=TRUE, PDF=NULL)

Diversity

Diversity of sequences and clones can be analyzed using trueDiversity() or clones.giniIndex().

True diversity

Using trueDiversity() richness or diversity of sequences with the same length can be analyzed. Basically diversity of amino acids at each position is calculated. True diversity can be measured for orders q = 0, 1 or 2. plotTrueDiversity() can be used for visualization. If mean.plot = F, a figure for each sequence length including diversity per position will be returned. If mean.plot = T only one figure with mean diversity values for all sequences with the same length will be returned.

Order 0: Richness (in this case it represents number of different amino acids per position).

Order 1: Exponential function of Shannon entropy using the natural logarithm (weights all amino acids by their frequency).

Order 2: Inverse Simpson entropy (weights all amino acids by their frequency, but weights are given more to abundant amino acids).

These indices are very similar (Hill, 1973). For example the exponential function of Shannon index is linearly related to inverse Simpson.

Amino acid sequences can be found in IMGT output table 5_AA-sequences(...).txt.

library(bcRep)
library(pander)
```r
# Example:
data(aaseqtab)
trueDiv<-trueDiversity(sequences = aaseqtab$CDR3_IMGT, order = 1) 
        # using exponent of Shannon entropy

# True diversity of order 1 for amino acid length of 5 AA 
        # (trueDiv$True_diversity$'sequence length = 5')

```r
pandoc.table(trueDiv$True_diversity$'sequence length = 5')

```r
# True diversity for sequences of amino acid length 14-17: 
trueDiv.part<-list(trueDiv$True_diversity_order, trueDiv$True_diversity[13:16])
names(trueDiv.part)<-names(trueDiv)
plotTrueDiversity(trueDiversity.tab=trueDiv.part, mean.plot = F,color="darkblue", PDF=NULL)
```r
plotTrueDiversity(trueDiversity.tab=trueDiv, mean.plot = T,color="darkblue", PDF=NULL)

Gini index

'clones.giniIndex()' calculates the Gini index of clones. Input is a vector containing clone sizes (copy number). The Gini index measures the inequality of clone size distribution. It ranges between 0 and 1. An index of 0 represents a polyclonal distribution, where all clones have same size. An index of 1 represents a perfect monoclonal distribution.

If a PDF project name is given, the Lorenz curve will be returned, as well. Lorenz curve can be interpreted as p*100 percent have L(p)*100 percent of clone size.

library(bcRep)
```r
# Example:
data(clones.ind)
clones.giniIndex(clone.size=clones.ind$total_number_of_sequences, PDF = NULL)

Gene usage

Gene usage can be analyzed using functions geneUsage() and sequences.geneComb().

It can be differentiated between subgroup (e.g. IGHV1), gene (e.g. IGHV1-1) or allele (e.g. IGHV1-1*2), as well as between relative or absolute values.

When using function geneUsage(), functionality and junction frame usage, dependent on gene usage can be studied. Please note, that input vectors have to have the same order. Single genes per line, as well as several genes per line, can be processed. IMGT nomenclature has to be used (e.g. "Homsap IGHV4-34*01")!

# Example:
data(summarytab)
Vgu<-geneUsage(genes = summarytab$V_GENE_and_allele, level = "subgroup", 
                functionality = summarytab$Functionality)
plotGeneUsage(geneUsage.tab = Vgu, plotFunctionality = TRUE, PDF = NULL, 
              title = "IGHV usage")

Using function sequences.geneComb(), gene/gene ratios of two gene families will be analyzed. IMGT nomenclature has to be used (e.g. "Homsap IGHV4-34*01"). Results can be plotted with and without gene/gene combinations, where one or both genes are unknown.

library(bcRep)
```r
# Example: 
data(summarytab)
VDcomb.tab<-sequences.geneComb(family1 = summarytab$V_GENE_and_allele, 
                family2 = summarytab$D_GENE_and_allele, 
                level = "subgroup", abundance = "relative")
plotGeneComb(geneComb.tab = VDcomb.tab, withNA = FALSE, PDF = NULL)

Functions for a set of sequences

Function to filter and to summarize IMGT/HighV-QUEST are provided:

Filter sequences

Sequence files can be filtered for functionality or junction frame usage:

sequences.getProductives(): filters for productive sequences
sequences.getUnproductives(): filters for unproductive sequences
sequences.getAnyFunctionality(): excludes sequences without any functionality information
sequences.getInFrames(): filters for in-frame sequences
sequences.getOutOfFrames(): filters for out-of-frame sequences
sequences.getAnyJunctionFrame(): excludes sequences without any junction frame information

The function looks for a column containing functionality or junction frame information and filters the whole table. Columns will remain, rows will be filtered. Junction frame information is only available in the IMGT summary table (1_Summary(...).txt). Functionality information is available in every table.

library(bcRep)
```r
# Example: filter for productive sequences
data(summarytab)
ProductiveSequences<-sequences.getProductives(summarytab)
# dimension of the summary table [rows, columns]:
dim(summarytab) 
# dimension of the summary table, filtered for productive sequences [rows, columns]:
dim(ProductiveSequences)

Functionality and junction frames

Some basic statistics about functionality and junction frames can be given by sequences.functionality() and sequences.junctionFrame(). Both functions return absolute or relative values for each kind of sequences.

library(bcRep)
library(pander)
```r
# Example:
data(summarytab)
sequences.functionality(data = summarytab$Functionality, relativeValues=TRUE)

# Proportion of productive and unproductive sequences:
```r
data(summarytab)
pandoc.table(sequences.functionality(data = summarytab$Functionality, relativeValues=TRUE))

Mutations

Some basic mutation analysis can be done with sequences.mutation(). Input is the IMGT output table 7_V-REGION-mutation-and-AA-change-table(...).txt and optionally 1_Summary(...).txt. Mutation analysis can be done for V-region, FR1-3 and CDR1-2 sequences. Functionality, as well as junction frame analysis can be included. Further the R/S ratio (ratio of replacement and silent mutations) can be returned.

# Example:
library(bcRep)
library(pander)

```r
data(mutationtab)
V.mutation<-sequences.mutation(mutationtab = mutationtab, sequence = "V", 
                               junctionFr = TRUE, rsRatio=TRUE)

# V.mutation$Number_of_mutations, first 6 lines:

```r
pandoc.table(head(V.mutation$Number_of_mutations))
# V.mutation$Number_of_mutations, first 6 lines:

```r
# V.mutation$Junction frame:

```r
pandoc.table(V.mutation$Junction_frame)

Proportions of nucleotides at and nearby mutation sites

Proportions of nucleotide distributions at and nearby mutation sites can be calculated using sequences.mutation.base(). This function calculates the porportion of a, c, g or t at the mutated positions and/or next to mutations from position -3 to +3. Parallel processing can be used for large datasets.

plotSequencesMutationAA() visualizes base proportions nearby mutations. Barplots are given for each base.

library(bcRep)
library(pander)

```r
# Example:

data(mutationtab)
data(summarytab)
V.BaseMut<-sequences.mutation.base(mutationtab = mutationtab, summarytab = summarytab, 
                                   analyseEnvironment = TRUE, analyseMutation = TRUE, 
                                   sequence = "V", nrCores=1)
plotSequencesMutationBase(mutationBaseTab = V.BaseMut, plotEnvironment = TRUE, 
     plotMutation = TRUE, colHeatmap = c("white","darkblue"))

Proportions of amino acid changes

Proportions of amino acid changes can be calculated using sequences.mutation.AA(). This function calculates the porportion of mutations from amino acid x to amino acid y. The original amino acids are in rows, the mutated one in columns.

plotSequencesMutationAA() visualizes the 20 x 20 data frame in some kind of heatmap, showing groups of proportions. Furthermore changes of amino acid characteristics, like hydropathy, volume or chemical can be shown (orange dots). Figures can be saved as PDF.

library(bcRep)
library(pander)

```r
# Example:

data(mutationtab)
V.AAmut<-sequences.mutation.AA(mutationtab = mutationtab, sequence = "V")
plotSequencesMutationAA(mutationAAtab = V.AAmut, showChange = "hydropathy")

Functions for a set of clones

bcRep provides some functions to study clone features. Some columns of the clonotype file can simply be plotted using boxplot() or barplot().

Defining clones and shared clones

Clones can be defined from a set of sequences, using clones(). Criteria for sequences, belonging to the same clones are:

same CDR3 (amino acid) length and a identity of a given treshold (it's possible, to look for same CDR3 sequences or for a CDR3 identity of e.g. 85%)
same V gene
same J gene (optional)

Output of the clone table can be individually specified (see help), but following columns are mandatory:

Shared CDR3 amino acid sequence(s)
CDR3 amino acid sequence length
Number of unique CDR3 sequences (each CDR3 sequence is mentioned once)
Number of all CDR3 sequences (CDR3 sequences are listed with duplicates)
Sequence count per CDR3
V gene (as used for analysis)
V gene and allele (IMGT nomenclature)
J gene (as used for analysis)
J gene and allele (IMGT nomenclature)

Optionally columns like D gene, CDR3 amino acid sequences, CDR3 nucleotide sequences, functionality, junction frames and so on, can be specified.

Further the function clones.IDlist() returns a table containing all sequence ID's provided in the IMGT tables and correpsonding clone ID's. This function only works, if sequence ID's were returned in clone table [clones(..., dispSeqID = TRUE)]. So you can get information about which sequence belongs to which clone or even which sequences were not matched to any of those clones.

# Example:
data(aaseqtab)
data(summarytab)
clones.tab<-clones(aaseqtab = aaseqtab, summarytab = summarytab, ntseqtab = NULL, 
                   identity = 0.85, useJ = TRUE, dispD = FALSE, dispSeqID = FALSE, 
                   dispCDR3aa = FALSE, dispCDR3nt = FALSE,  
                   dispJunctionFr.ratio = FALSE, dispJunctionFr.list = FALSE, 
                   dispFunctionality.ratio = FALSE, dispFunctionality.list = FALSE, 
                   dispTotalSeq = FALSE, nrCores=1)

```r
# Example of a clonotype file (not from the command above)
data(clones.ind)
str(clones.ind, strict.width="cut", width=85)

data(aaseqtab)
data(summarytab)
clones.tab<-clones(aaseqtab = aaseqtab, summarytab = summarytab, dispSeqID = T)
clones.ID<-clones.IDlist(clones.seqID = clones.tab$Sequence_IDs, summarytab.seqID = summarytab$Sequence_ID)
clones.ID[1:5,] # Example of output of clones.IDlist()

Filter clones for their size

Using clones.filterSize() you can filter a set of clonotypes by their size. Three methods for filtering are available.

You can set tresholds for the parameters...

number: filters the table for a given number of sequences; e.g. 20 smallest clones
propOfClones: filters the table for the proportion of the total number of clones, e.g. the 10% biggest clones
propOfSequences: filters the table for the proportion of all sequences, belonging to all clones; e.g. clones, that include 1% of all sequences

Additionally, filtering can be done for both ends (biggest and smallest clones; two.tailed), or only for biggest (upper.tail) or smallest (lower.tail) clones. Biggest clones refers to clones with the biggest sequence number (or largest size/copy number) (same for smallest clones).

library(bcRep)
library(pander)
```r
# Example 1: Getting the 3 smallest clones (clones with 85% CDR3 identity)
clones.filtered1<-clones.filterSize(clones.tab=clones.ind, 
                                    column="total_number_of_sequences", 
                                    number=3, method="lower.tail")

    # Output of clones.filtered1[,c("total_number_of_sequences",
            # "unique_CDR3_sequences_AA", "CDR3_length_AA", "V_gene")]:
```r
pandoc.table(data.frame(clones.filtered1[,c("total_number_of_sequences","unique_CDR3_sequences_AA","CDR3_length_AA","V_gene")], row.names=NULL))

```r
# Example 2: Getting 10% biggest and smallest clones 
# (clones with 85% CDR3 identity; column 4 = "total_number_of_sequences")
clones.filtered2<-clones.filterSize(clones.tab=clones.ind, column=4, propOfClones=0.1,
                                    method="two.tailed")
names(clones.filtered2) # a list with biggest and smallest 10% clones
dim(clones.filtered2$upper.tail) # dimension of table with biggest clones [rows, columns]
dim(clones.filtered2$lower.tail) # dimension of table with smallest clones [rows, columns]

# Example 3: Getting clones, that include 0.5% of all sequences 
# (clones with 85% CDR3 identity; column 4 = "total_number_of_sequences"")
clones.filtered3<-clones.filterSize(clones.tab=clones.ind, column=4,
                                    propOfSequences=0.005, method="two.tailed")
names(clones.filtered3) # a list with biggest and smallest 10% clones
dim(clones.ind) ## dimension of clones.ind table [rows, columns]
## >> number of rows is equal to number of rows of biggest and smallest clones
dim(clones.filtered3$upper.tail) # dimension of table with biggest clones [rows, columns]
dim(clones.filtered3$lower.tail) # dimension of table with smallest clones [rows, columns]

Filter clones for functionality or junction frame usage

clones.filterFunctionality() and clones.filterJunctionFrame() filter set of clones for functionality or junction frame usage.

Filtering for functionality: filter clones for only productive or unproductive sequences

library(bcRep)
```r
# Example
data(clones.ind)
productiveClones<-clones.filterFunctionality(clones.tab = clones.ind, 
                                             filter = "productive")
# dimension of clones.ind [rows, columns]:
dim(clones.ind) 
# dimension of clones.ind, filtered for productive sequences [rows, columns]:
dim(productiveClones)

Filtering for junction frame usage: filter for only in-frame or out-of-frame sequences

library(bcRep)
```r
# Example
data(clones.ind)
inFrameClones<-clones.filterJunctionFrame(clones.tab = clones.ind, filter = "in-frame")
# dimension of clones.ind [rows, columns]:
dim(clones.ind) 
# dimension of clones.ind, filtered for in-frame sequences [rows, columns]:
dim(inFrameClones)

Clone features

Functions to analyse clone features, like CDR3 length distribution or clone copy number, are provided.

CDR3 length distribution of clones

Using clones.CDR3Length() CDR3 length distribution of clones can be estimated. The corresponding visualization method is plotClonesCDR3Length(). Input for both functions are vectors containing nucleotide or amino acid sequences. Further functionality and junction frame usage, depending on CDR3 length can be studied. Absolute or relative values can be returned.

library(bcRep)
library(pander)
```r
# Example:
data(clones.ind)
CDR3length<-clones.CDR3Length(CDR3Length = clones.ind$CDR3_length_AA, 
                  functionality = clones.ind$Functionality_all_sequences)

# output of some CDR3 length proportions (CDR3length$CDR3_length[1:3]): 
```r
pandoc.table(CDR3length$CDR3_length[1:3])
```r
# output of functionality ratios for some CDR3 lengths 
    # (CDR3length$CDR3_length_vs_functionality[,1:3)])
```r
pandoc.table(CDR3length$CDR3_length_vs_functionality[,1:3])

```r
plotClonesCDR3Length(CDR3Length = clones.ind$CDR3_length_AA, 
                     functionality = clones.ind$Functionality_all_sequences, 
                     title = "CDR3 length distribution", PDF = NULL)

Clone copy number distribution

plotClonesCopyNumber() can be used to plot clone sizes as an boxplot.

# Example: Clone copy number distribution with and without outliers
plotClonesCopyNumber(copyNumber = clones.ind$total_number_of_sequences, 
                     withOutliers=TRUE, color = "darkblue", 
                     title = "Copy number distribution", PDF = NULL)
plotClonesCopyNumber(copyNumber = clones.ind$total_number_of_sequences, 
                     withOutliers=FALSE, color = "darkblue", 
                     title = "Copy number distribution", PDF = NULL)

Further functions like geneUsage()or trueDiversity() can be used for further analysis of clone sets.

Comparison of different samples

If more than one sample should be analyzed, bcRep provides some functions to compare different samples. Gene usage, amino acid distribution, diversity and shared clones can be compared and visualized.

Comparison of gene usage

Gene usage can be compared using compare.geneUsage(). Input is a list including sequence vectors of each individual. If no names for the samples are specified, samples will be called Sample1, Sample2, ... in input order. Analysis of subgroup (e.g. IGHV1), gene (e.g. IGHV1-1) or allele (e.g. IGHV1-1*2), as well as of relative or absolute values is possible.

Results can be visualized using plotCompareGeneUsage() (optional saved as PDF in working directory). A heatmap will be returned, with samples as rows and genes as columns. Proportions of genes are color coded.

library(bcRep)
library(pander)
```r
data(aaseqtab)
data(aaseqtab2)
V.comp<-compare.geneUsage(gene.list = list(aaseqtab$V_GENE_and_allele, 
                                           aaseqtab2$V_GENE_and_allele), 
                               level = "subgroup", abundance = "relative", 
                          names = c("Individual1", "Individual2"), 
                          nrCores = 1)
```r

pandoc.table(V.comp)
```r
plotCompareGeneUsage(comp.tab = V.comp, color = c("gray97", "darkblue"), PDF = NULL)

Comparison of amino acid distribution

The amino acid distribution between two or more individuals can be compared using compare.aaDistribution(). Input is a list including sequence vectors for each individual. Sequences of the same length will be analyzed together. Optional the number of sequences used for analysis can be returned. If no names for the samples are specified, samples will be called Sample1, Sample2, ... in input order.

The output is a list, containing amino acid distributions for each sequence length and each individual.

plotCompareAADistribution() returns a plot (optional saved as PDF in working directory), where columns represent the samples, and rows represent different sequence lengths. In each field, one bar represents one position of the sequence.

library(bcRep)
```r
data(aaseqtab)
data(aaseqtab2)
AAdistr.comp<-compare.aaDistribution(sequence.list = list(aaseqtab$CDR3_IMGT,
                                                          aaseqtab2$CDR3_IMGT), 
                                     names = c("Individual1", "Individual2"), 
                                     numberSeq = TRUE, nrCores = 1)

# Comparison of sequence length of 14-16 amino acids: 
## Individual 1: grep("14|15|16",names(AAdistr.comp$Amino_acid_distribution$Individual1)) 
        ## >> 13:15
## Individual 2: grep("14|15|16",names(AAdistr.comp$Amino_acid_distribution$Individual2)) 
        ## >> 12:14
AAdistr.comp.part<-list(list(AAdistr.comp$Amino_acid_distribution$Individual1[13:15], 
                             AAdistr.comp$Amino_acid_distribution$Individual2[12:14]),
                        list(AAdistr.comp$Number_of_sequences_per_length$Individual1[13:15],
                             AAdistr.comp$Number_of_sequences_per_length$Individual2[12:14]))
names(AAdistr.comp.part)<-names(AAdistr.comp)
names(AAdistr.comp.part$Amino_acid_distribution)<-
  names(AAdistr.comp$Amino_acid_distribution)
names(AAdistr.comp.part$Number_of_sequences_per_length)<-
  names(AAdistr.comp$Number_of_sequences_per_length)

plotCompareAADistribution(comp.tab = AAdistr.comp.part, plotSeqN = TRUE, 
                          colors=c("darkblue","darkred"), PDF = NULL)

Comparison of richness and diversity

Richness and diversity can be compared using compare.trueDiversity(). Input can be a list of sequences or the output of compare.aaDistribution(). Diversity can be analyzed for order 0 (richness), 1 (exponent of Shannon entropy) or 2 (inverse Simpson) (see diversity analysis above). If no names for the samples are specified, samples will be called Sample1, Sample2, ... in the input order.

The output is a list, containing richness or diversity indices for each sequence length and each individual.

Results can be visualized using plotCompareTrueDiversity() (optional saved as PDF in working directory). A plot will be returned, with one figure for each sequence length, if mean.plot = F. If mean.plot = T only one figure with mean diversity values for all sequences with the same length will be returned. Each figure contains the sequence position on the x-axis and the richness/diversity index on the y-axis. If no colors are specified, rainbow() will be used.

library(bcRep)
```r
data(aaseqtab)
data(aaseqtab2)
trueDiv.comp<-compare.trueDiversity(sequence.list = list(aaseqtab$CDR3_IMGT, 
                                                         aaseqtab2$CDR3_IMGT), 
                                    names = c("Individual1", "Individual2"), 
                                    order = 1, nrCores = 1)

# Comparison of sequence length of 14-16 amino acids: 
grepindex1<-grep("14|15|16",names(trueDiv.comp$Individual1))
grepindex2<-grep("14|15|16",names(trueDiv.comp$Individual2))
trueDiv.comp.part<-list(trueDiv.comp$True_diversity_order, 
                        trueDiv.comp$Individual1[grepindex1],
                        trueDiv.comp$Individual2[grepindex2])
names(trueDiv.comp.part)<-names(trueDiv.comp)
plotCompareTrueDiversity(comp.tab = trueDiv.comp.part, mean.plot = F, colors=c("darkblue","darkred"), 
                         PDF = NULL)
plotCompareTrueDiversity(comp.tab = trueDiv.comp, mean.plot = T, colors = c("darkblue","darkred"), 
                         PDF = NULL)

Looking for clones, that are shared between several samples

To compare clones of different samples, the function clones.shared() can be used. The criteria are the same than in function clones():

same CDR3 (amino acid) length and a identity of a given treshold (so it's possible, to look for same CDR3 sequences or for a CDR3 identity of 85%)
same V gene
same J gene (optional)

The input for the function has to be prepared as following: combine all individual clone files to one using rbind() and add a first column with sample ID's. An example you can find in data(clones.allind).

library(bcRep)
library(pander)
```r
# Example:
data(clones.allind) # includes 2948 clones for 7 individuals
colnames(clones.allind) # colnames of clones.allind

Output of 'clones.shared()' is a table, which can have the same columns than the ones from clones(). Individual data is seperated by ";". Columns like Number of samples sharing a clone ID's of samples sharing a clone * copy number of sequences per sample are provided.

# Example:
data(clones.allind)
sharedclones<-clones.shared(clones.tab = clones.allind) 
              # with default parameters: identity = 0.85, useJ = TRUE, ....

The output of 'clones.shared()' can be summarized using clones.shared.summary(). This function returns a data frame, which contains the number of individual clones (optional, only if clone table is given) and the number of shared clones. The number of individual clones is equivalent to the total number of clones per individual minus the number of shared clones.

# Example:
data(clones.allind)
sharedclones<-clones.shared(clones.tab = clones.allind, identity = 0.85, useJ = TRUE)
sharedclones.summary<-clones.shared.summary(shared.tab = sharedclones, 
                                            clones.tab = clones.allind)
```r
# Example:
data(clones.allind)
sharedclones<-clones.shared(clones.tab = clones.allind, identity = 0.85, useJ = TRUE)
sharedclones.summary<-clones.shared.summary(shared.tab = sharedclones, 
                                            clones.tab = clones.allind)

# sharedclones.summary:
```r
pandoc.table(sharedclones.summary)

Dissimilarity and distance measurements on gene usage data

The function 'geneUsage.distance()' provides several dissimilarity and distance measurements for gene usage data. Bray-Curtis, Jaccard and cosine indices can be calculated on data (taken from vegan::vegdist and proxy::dist).frames with genes in columns and samples in rows, see Example below:

# Example:
data(vgenes) 
vgenes[1:5,1:5]
```r
data(vgenes) 
pandoc.table(vgenes[1:5,1:5])

'geneUsage.distance()' then calculates a distance matrix:

# Example for 5 samples and 5 genes:
data(vgenes) 
geneUsage.distance(geneUsage.tab = vgenes[1:5,1:5], method = "bc")
```r
data(vgenes) 
pandoc.table(geneUsage.distance(geneUsage.tab = vgenes[1:5,1:5], method = "bc"))

Dissimilarity and distance measurements on sequences data

Using 'sequences.distance()' distance matrices, based on sequence data, e.g. CDR3 sequences, can be calculated. Following dissmilarity/distance measurements can be used (taken from stringdist::stringdist-metric):

Levenshtein
cosine
q-gram
Jaccard
Jaro-Winker ('ja-wi')
Damerau-Levenshtein ('dam-le')
Hamming
Optimal string alignment ('osa')
Longest common substring ('lcs')

Further it can be distinguished wheather all sequences or subsets of sequences of the same length shall be used for distance estimation.

# Example 1:
## Divide sequences into subsets of same length and then apply Levenshtein distance:
data(clones.ind)
clones.ind<-subset(clones.ind, CDR3_length_AA==7) #take only a subset of clones.ind (CDR3 sequences of 7 AA)
dist1<-sequences.distance(sequences = clones.ind$unique_CDR3_sequences_AA, 
     method = "levenshtein", divLength=TRUE)
# Subset of output for 7AA sequences:
dist1[[1]][1:5,1:5]
```r
data(clones.ind)
clones.ind<-subset(clones.ind, CDR3_length_AA==7)
dist1<-sequences.distance(sequences = clones.ind$unique_CDR3_sequences_AA, 
     method = "levenshtein", divLength=TRUE)
pandoc.table(dist1[[1]][1:5,1:5])

```r
# Example 2: 
## Use all sequences with a length of 7 or 8 AA for cosine distance matrix:
data(clones.allind)
clones.allind<-subset(clones.allind, CDR3_length_AA %in% c(9,31)) #take only a subset of clones.ind (CDR3 sequences of 9 or 31 AA)
dist2<-sequences.distance(sequences = clones.allind$unique_CDR3_sequences_AA, 
     groups = clones.allind$individuals, method = "cosine", divLength=FALSE)
# Subset of output:
dist2[[1]][1:4,1:4]
```r
data(clones.allind)
clones.allind<-subset(clones.allind, CDR3_length_AA %in% c(9,31))

dist2<-sequences.distance(sequences = clones.allind$unique_CDR3_sequences_AA, 
     groups = clones.allind$individuals, method = "cosine", divLength=FALSE)
pandoc.table(dist2[[1]][1:4,1:4])

Principal coordinate analysis on distance matrices

Principal coordinate analysis can be performed on distance matrices. The function dist.PCoA() provides principal coordinate objects, like in 'ape::pcoa()', with an output as follows (taken from ape::pcoa):

correction: value of parameter and correction variable, if any
note: type of correction
values : The eigenvalues and related information:
- Eigenvalues: all eigenvalues (positive, null, negative)
- Relative_eig: relative eigenvalues
- Corr_eig: corrected eigenvalues
- Rel_corr_eig: relative eigenvalues after Lingoes or Cailliez correction.
- Broken_stick: expected fractions of variance under the broken stick model.
- Cumul_eig: cumulative relative eigenvalues.
- Cum_corr_eig: cumulative corrected relative eigenvalues.
- Cumul_br_stick: cumulative broken stick fractions.
vectors: the principal coordinates with positive eigenvalues.
trace: the trace of the distance matrix. This is also the sum of all eigenvalues, positive and negative.
vectors.cor: the principal coordinates with positive eigenvalues from the distance matrix corrected using the method specified by parameter correction.
trace.cor: the trace of the corrected distance matrix. This is also the sum of its eigenvalues.

The results can also be plotted, using 'plotDistPCoA()'. Input has to be the PCoA table from 'dist.PCoA()', a 'groups' data.frame and a pair of axes to be plotted. Optional arguments are names, plotCorrection (default: FALSE), title, plotLegend (default: FALSE) and PDF.

# Example for gene usage data:
data(vgenes) 
vgenes<-vgenes[,1:10] # include only genes 1-10
gu.dist<-geneUsage.distance(geneUsage.tab = vgenes, method = "bc")
gu.pcoa<-dist.PCoA(dist.tab = gu.dist, correction = "cailliez")
```r
str(gu.pcoa) # a PCoA object
```r
# Example of a 'groups' data.frame: in the case, there are two groups:
groups.vec<-cbind(rownames(vgenes),c(rep("group1",5),rep("group2",5)))
colnames(groups.vec)<-c("Sample","Group")
pandoc.table(groups.vec)

plotDistPCoA(pcoa.tab = gu.pcoa, groups= groups.vec, axes = c(1,2), 
             names = rownames(vgenes), plotLegend = TRUE)

# Example for sequence data, divided into subsets of the same length:
data(clones.ind)
clones.ind<-subset(clones.ind, CDR3_length_AA %in% c(7:10)) # include only CDR3 sequences of 7-10 AA
seq.dist<-sequences.distance(sequences = clones.ind$unique_CDR3_sequences_AA, 
     method = "levenshtein", divLength=T)
seq.pcoa<-dist.PCoA(dist.tab = seq.dist, correction = "none")

# Example of a 'groups' data.frame: 
## in the case, there are no groups (all samples have the same group):
groups.vec<-unlist(apply(data.frame(clones.ind$unique_CDR3_sequences_AA),1,
                         function(x){strsplit(x,split=", ")[[1]]}))
groups.vec<-cbind(groups.vec, 1)

plotDistPCoA(pcoa.tab = seq.pcoa, groups=groups.vec, axes = c(1,2), 
             plotCorrection = FALSE, 
             title = 'PCoA for sequences of length 7-10 AA', plotLegend=FALSE)

Any scripts or data that you put into this service are public.

bcRep documentation built on May 2, 2019, 5:14 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.