In tcR: Advanced Data Analysis of Immune Receptor Repertoires

IMPORTANT INFORMATION

The tcR package WILL SOON BE ORPHANED AND REMOVED FROM CRAN.

A new package is available that is designed to replace tcR: immunarch -- http://immunarch.com/

We will be happy to help you to move to the new package. Feel free to contact us: http://github.com/immunomind/immunarch

Sincerely, immunarch dev team and Vadim I. Nazarov, lead developer of tcR

Introduction

The tcR package designed to help researchers in the immunology field to analyse T cell receptor (TCR) and immunoglobulin (Ig) repertoires. In this vignette, I will cover procedures for immune receptor repertoire analysis provided with the package.

Terms:

Clonotype: a group of T / B cell clones with equal CDR3 nucleotide sequences and equal Variable genes.
Cloneset / repertoire: a set of clonotypes. Represented as a data frame in which each row corresponds to a unique clonotype.
UMI: Unique Molecular Identifier (see this paper for details)

library(tcR)
data(twa)
data(twb)

Package features

Parsers for outputs of various tools for CDR3 extraction and genes alignment (currently implemented parsers for MiTCR, MiGEC, VDJtools, ImmunoSEQ, IMSEQ and MiXCR)
Data manipulation (in-frame / out-of-frame sequences subsetting, clonotype motif search)
Descriptive statistics (number of reads, number of clonotypes, gene segment usage)
Shared clonotypes statistics (number of shared clonotypes, using V genes or not; sequential intersection among the most abundant clonotype ("top-cross"))
Repertoire comparison (Jaccard index, Morisita's overlap index, Horn's index, Tversky index, overlap coefficient)
V- and J genes usage and it's analysis (PCA, Shannon Entropy, Jensen-Shannon Divergence)
Diversity evaluation (ecological diversity index, Gini index, inverse Simpson index, rarefaction analysis)
Artificial repertoire generation (beta chain only, for now)
Spectratyping
Various visualisation procedures
Mutation networks (graphs, in which vertices represent CDR3 nucleotide / amino acid sequences and edges are connecting similar sequences with low hamming or edit distance between them)

Data in the package

There are two datasets provided with the package - twins data and V(D)J recombination genes data.

Downsampled twins data

twa.rda, twb.rda - two lists with 4 data frames in each list. Every data frame is a sample downsampled to the 10000 most abundant clonotypes of twins data (alpha and beta chains). Full data is available here:

Twins TCR data at Laboratory of Comparative and Functional Genomics

Explore the data:

# Load the package and load the data.
library(tcR)
data(twa)  # "twa" - list of length 4
data(twb)  # "twb" - list of length 4

# Explore the data.
head(twa[[1]])
head(twb[[1]])

Gene alphabets

Gene alphabets - character vectors with names of genes for TCR and Ig.

# Run help to see available alphabets.
?genealphabets
?genesegments
data(genesegments)

Quick start / automatic report generation

For the exploratory analysis of a single repertoire, use the RMarkdown report file at

"<path to the tcR package>/inst/library.report.Rmd"

Analysis in the file include statistics and visualisation of number of clones, clonotypes, in- and out-of-frame sequences, unique amino acid CDR3 sequences, V- and J-usage, most frequent k-mers, rarefaction analysis.

For the analysis of a group of repertoires ("cross-analysis"), use the RMarkdown report file at:

"<path to the tcR package>/inst/crossanalysis.report.Rmd}"

Analysis in this file include statistics and visualisation of number of shared clones and clonotypes, V-usage for individuals and groups, J-usage for individuals, Jensen-Shannon divergence among V-usages of repertoires and top-cross.

You need the knitr package installed in order to generate reports from default pipelines. In RStudio you can run a pipeline file as follows:

Run RStudio -> load the pipeline .Rmd files -> press the knitr button

Input parsing

Currently in tcR there are implemented parser for the next software:

MiTCR - parse.mitcr;
MiTCR w/ UMIs - parse.mitcrbc;
MiGEC - parse.migec;
VDJtools - parse.vdjtools;
ImmunoSEQ - parse.immunoseq;
MiXCR - parse.mixcr;
IMSEQ - parse.imseq.

Also a general parser parse.cloneset for a text table files is implemented. General wrapper for parsers is parse.file. User can also parse a list of files or the entire folder. Run ?parse.folder to see a help on parsing input files and a list of functions for parsing a specific input format.

# Parse file in "~/mitcr/immdata1.txt" as a MiTCR file.
immdata1 <- parse.file("~/mitcr_data/immdata1.txt", 'mitcr')
# equivalent to
immdata1.eq <- parse.mitcr("~/mitcr_data/immdata1.txt")

# Parse folder with MiGEC files.
immdata <- parse.folder("~/migec_data/", 'migec')

Cloneset representation

Clonesets represented in tcR as data frames with each row corresponding to the one nucleotide clonotype and with specific column names:

Umi.count - number of UMIs;
Umi.proportion - proportion of UMIs;
Read.count - number of reads;
Read.proportion - proportion of reads;
CDR3.nucleotide.sequence - CDR3 nucleotide sequence;
CDR3.amino.acid.sequence - CDR3 amino acid sequence;
V.gene - names of aligned Variable genes;
J.gene - names of aligned Joining genes;
D.gene - names of aligned Diversity genes;
V.end - last positions of aligned V genes (1-based);
J.start - first positions of aligned J genes (1-based);
D5.end - positions of D'5 end of aligned D genes (1-based);
D3.end - positions of D'3 end of aligned D genes (1-based);
VD.insertions - number of inserted nucleotides (N-nucleotides) at V-D junction (-1 for receptors with VJ recombination);
DJ.insertions - number of inserted nucleotides (N-nucleotides) at D-J junction (-1 for receptors with VJ recombination);
Total.insertions - total number of inserted nucleotides (number of N-nucleotides at V-J junction for receptors with VJ recombination).

Any data frame with this columns and of this class is suitable for processing with the package, hence user can generate their own table files and load them for the further analysis using read.csv, read.table and other base R functions. Please note that tcR internally expects all strings to be of class "character", not "factor". Therefore you should use R parsing functions with parameter stringsAsFactors=FALSE.

# No D genes is available here hence "" at "D.genes" and "-1" at positions.
str(twa[[1]])

str(twb[[1]])

Repertoire descriptive statistics

For the exploratory analysis tcR provides various functions for computing descriptive statistics.

Cloneset summary

To get a general view of a subject's repertoire (overall count of sequences, in- and out-of-frames numbers and proportions) use the cloneset.stats function. It returns a summary of counts of nucleotide and amino acid clonotypes, as well as summary of read counts:

cloneset.stats(twb)

For characterisation of a library use the repseq.stats function:

repseq.stats(twb)

Most abundant clonotypes statistics

Function clonal.proportion is used to get the number of most abundant by the count of reads clonotypes. E.g., compute number of clonotypes which fill up (approx.) the 25% from total repertoire's "Read.count":

                            # How many clonotypes fill up approximately
clonal.proportion(twb, 25)  # the 25% of the sum of values in 'Read.count'?

To get a proportion of the most abundant clonotypes' sum of reads to the overall number of reads in a repertoire, use top.proportion, i.e. get

($\sum$ reads of top clonotypes)$/$($\sum$ reads for all clonotypes).

E.g., get a proportion of the top-10 clonotypes' reads to the overall number of reads:

                          # What accounts a proportion of the top-10 clonotypes' reads
top.proportion(twb, 10)   # to the overall number of reads?
vis.top.proportions(twb)  # Plot this proportions.

Function tailbound.proportion with two arguments .col and .bound gets subset of the given data frame with clonotypes which have column .col with value $\leq$ .bound and computes the ratio of sums of count reads of such subset to the overall data frame. E.g., get proportion of sum of reads of sequences which has "Read.count" <= 100 to the overall number of reads:

                                # What is a proportion of sequences which
                                # have 'Read.count' <= 100 to the
tailbound.proportion(twb, 100)  # overall number of reads?

Clonal space homeostasis

Clonal space homeostasis is a useful statistics of how many space occupied by clonotypes with specific proportions.

# data(twb)
# Compute summary space of clones, that occupy
# [0, .05) and [.05, 1] proportion.
clonal.space.homeostasis(twb, c(Low = .05, High = 1))
# Use default arguments:
clonal.space.homeostasis(twb[[1]])

twb.space <- clonal.space.homeostasis(twb)
vis.clonal.space(twb.space)

In-frame and out-of-frame sequences

Functions for performing subsetting and counting number of in-frame and out-of-frame clonotypes are: count.inframes, count.outframes, get.inframes, get.outframes. Parameter .head for this functions is a parameter to the .head function, that applied to the input data frame or an input list of data frames before subsetting. Functions accept both data frames and list of data frames as parameters. E.g., get data frame with only in-frame sequences and count out-of-frame sequences in the first 5000 rows for this data frame:

imm.in <- get.inframes(twb) # Return all in-frame sequences from the 'twb'.

                            # Count the number of out-of-frame sequences
count.outframes(twb, 5000)  # from the first 5000 sequences.

General functions with parameter stands for 'all' (all sequences), 'in' (only in-frame sequences) or 'out' (only out-of-frame sequences) are get.frames and count.frames:

imm.in <- get.frames(twb, 'in') # Similar to 'get.inframes(twb)'.

count.frames(twb[[1]], 'all')   # Just return number of rows.

flag <- 'out'
count.frames(twb, flag, 5000)   # Similar to 'count.outframes(twb, 5000)'.

Search for a target CDR3 sequences

For exact or fuzzy search of sequences the package employed a function find.clonotypes. Input arguments for this function are a data frame or a list of data frames, targets (a character vector or data frame having one column with sequences and additional columns with, e.g., V genes), a value of which column or columns to return, a method to be used to compare sequences among each other (either "exact" for exact matching, "hamm" for matching sequences by Hamming distance (two sequences are matched if H $\leq$ 1) or "lev" for matching sequences by Levenshtein distance (two sequences are matched if L $\leq$ 1)), and column name from which sequences for matching are obtained. Sounds very complex, but in practice it's very easy, therefore let's go to examples.

Suppose we want to search for some CDR3 sequences in a number of repertoires:

cmv <- data.frame(CDR3.amino.acid.sequence = c('CASSSANYGYTF', 'CSVGRAQNEQFF', 'CASSLTGNTEAFF', 'CASSALGGAGTGELFF', 'CASSLIGVSSYNEQFF'),
                  V.genes = c('TRBV4-1', 'TRBV4-1', 'TRBV4-1', 'TRBV4-1', 'TRBV4-1'), stringsAsFactors = F)

cmv

We will search for them using all methods of matching (exact, hamming or levenshtein) and with and without matching by V-segment. Also, for the first case (exact matching and without V gene) we return "Total.insertions" column along with the "Read.count" column, and for the second case output will be a "Rank" - rank (generated by set.rank) of a clone or a clonotype in a data frame.

twb <- set.rank(twb)
# Case 1.
cmv.imm.ex <- 
  find.clonotypes(.data = twb[1:2], .targets = cmv[,1], .method = 'exact',
                  .col.name = c('Read.count', 'Total.insertions'),
                  .verbose = F)
head(cmv.imm.ex)

# Case 2.
# Search for CDR3 sequences with hamming distance <= 1
# to the one of the cmv$CDR3.amino.acid.sequence with
# matching V genes. Return ranks of found sequences.
cmv.imm.hamm.v <- 
  find.clonotypes(twb[1:3], cmv, 'hamm', 'Rank', 
                  .target.col = c('CDR3.amino.acid.sequence',
                                  'V.gene'),
                  .verbose = F)
head(cmv.imm.hamm.v)

# Case 3.
# Similar to the previous example, except
# using levenshtein distance and the "Read.count" column.
cmv.imm.lev.v <- 
  find.clonotypes(twb[1:3], cmv, 'lev', 
                  .target.col = c('CDR3.amino.acid.sequence', 'V.gene'),
                  .verbose = F)
head(cmv.imm.lev.v)

Gene usage

Variable and Joining gene usage (V-usage and J-usage) are important characteristics of repertoires. To access and compare them among repertoires tcR provides a few useful functions.

Gene usage computing

To access V- and J-usage of a repertoire tcR provides functions geneUsage. Function geneUsage, depending on parameters, computes frequencies or counts of the given elements (e.g., V genes) of the input data frame or the input list of data frames. V and J gene names for humans for TCR and Ig are stored in the .rda file genesegments.rda (they are identical to those form IMGT: \href{http://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=nomenclatures&species=human&group=TRBV}{link to beta genes (red ones)} and \href{http://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=nomenclatures&species=human&group=TRAV}{link to alpha genes (red ones)}). All of the mentioned functions are accept data frames as well as list of data frames. Output for those functions are data frames with the first column stands for a gene and the other for frequencies.

imm1.vs <- geneUsage(twb[[1]], HUMAN_TRBV)
head(imm1.vs)

imm.vs.all <- geneUsage(twb, HUMAN_TRBV)
imm.vs.all[1:10, 1:4]

You can also directly visualise gene usage with the function vis.gene.usage (if you pass the gene alphabet as a second argument):

# Put ".dodge = F" to get distinct plot for every data frame in the given list.
vis.gene.usage(twb, HUMAN_TRBJ, .main = 'twb J-usage dodge', .dodge = T)

vis.gene.usage(twb, HUMAN_TRBJ, .main = 'twb J-usage column', .dodge = F, .ncol = 2)

vis.gene.usage(imm1.vs, NA, .main = 'twb[[1]] V-usage', .coord.flip = F)

Gene usage comparing

To evaluate V- and J genes usage of repertoires, the package implements subroutines for two approaches to the analysis: measures from the information theory and PCA (Principal Component Analysis).

Shannon entropy and Jensen-Shannon divergence

To assess the diversity of genes usage user can use the entropy function. Kullback-Leibler assymetric measure (function kl.div) and Jensen-Shannon symmetric measure (functions js.div for computing JS-divergence between the given distributions and js.div.seg for computing JS-divergence between genes distributions of two clonesets or a list with data frames) are provided to estimate distance among gene usage of different repertoires. To visualise distances tcR employed the vis.radarlike function, see Section "Plots" for more detailed information.

                              # Transform "0:100" to distribution with Laplace correction 
entropy(0:100, .laplace = 1)  # (i.e., add "1" to every value before transformation).

entropy.seg(twb, HUMAN_TRBV)  # Compute entropy of V-segment usage for each data frame.

js.div.seg(twb[1:2], HUMAN_TRBV, .verbose = F)
imm.js <- js.div.seg(twb, HUMAN_TRBV, .verbose = F) 
vis.radarlike(imm.js, .ncol = 2)

Principal Component Analysis (PCA)

Principal component analysis (PCA) is a statistical procedure for transforming a set of observations to a set of special values for analysis. In tcR implemented functions pca.segments for performing PCA on V- or J-usage, and pca.segments.2D for performing PCA on VJ-usage. For plotting the PCA results see the vis.pca function.

pca.segments(twb, .genes = HUMAN_TRBV)  # Plot PCA results of V-segment usage.

# Return object of class "prcomp"
class(pca.segments(twb, .do.plot = F, .genes = HUMAN_TRBV))

Repertoire overlap analysis

tcR provides a number of functions for evaluating similarity of clonesets based on shared among clonesets clonotypes and working with data frames with shared clonotypes.

Overlap quantification

The general interface to all functions for computing cloneset overlap coefficients is the repOverlap function.

Number of shared clonotypes

The most straightforward yet a quite effective way to evaluate similarity of two clonesets is compute the number of shared clonotypes. tcR adds the new function intersectClonesets (repOverlap(your_data, 'exact')) which is by default computes the number of shared clonotypes using the "CDR3.nucleotide.sequence" columns of the given data frames, but user can change target columns by using arguments .type or .col. As in the find.clonotypes, user can choose which method apply to the elements: exact match of elements, match by Hamming distance or match by Levenshtein distance. Logical argument .norm is used to perform normalisation of the number of shared clonotypes by dividing this number by multiplication of clonesets' sizes (strongly recommended otherwise your results will be correlating with clonesets' sizes).

# Equivalent to intersect(twb[[1]]$CDR3.nucleotide.sequence,
#                         twb[[2]]$CDR3.nucleotide.sequence)
repOverlap(twb[1:2], 'exact', 'nuc', .verbose = F)

# Equivalent to intersectClonesets(twb, "n0e", .norm = T)
repOverlap(twb, 'exact', 'nuc', .norm = T, .verbose = F)
# Intersect by amino acid clonotypes + V genes
repOverlap(twb, 'exact', 'aa', .vgene = T, .verbose = F)

# Plot a heatmap of the number of shared clonotypes.
vis.heatmap(repOverlap(twb, 'exact', 'aa', .vgene = T, .verbose = F), .title = 'twb - (ave)-intersection', .labs = '')

See the vis.heatmap function in the Section "Visualisation" for the visualisation of the intersection results.

Functions intersectCount, intersectLogic and intersectIndices are more flexible in terms of choosing which columns to match. They all have parameter .col that specifies names of columns which will used in computing intersection. Function intersectCount returns number of similar elements; intersectIndices(x, y) returns 2-column matrix with the first column stands for an index of an element in the given x, and the second column stands for an index of that element of y which is similar to a relative element in x; intersectLogic(x, y) returns a logical vector of length(x) or nrow(x), where TRUE at position i means that element with index {i} has been found in the y.

# Get logic vector of shared elements, where
# elements are tuples of CDR3 nucleotide sequence and corresponding V-segment
imm.1.2 <- intersectLogic(twb[[1]], twb[[2]],
                           .col = c('CDR3.amino.acid.sequence', 'V.gene'))  
# Get elements which are in both twb[[1]] and twb[[2]].
head(twb[[1]][imm.1.2, c('CDR3.amino.acid.sequence', 'V.gene')])

"Top cross"

Number of shared clonotypes among the most abundant clonotypes may differ signigicantly from those with lesses count. To support research tcR offers the top.cross function, that apply tcR::intersectClonesets, e.g., to the first 1000 clonotypes, 2000, 3000 and so on up to the first 100000 clones, if supplied .n == seq(1000, 100000, 1000).

twb.top <- top.cross(.data = twb, .n = seq(500, 10000, 500), .verbose = F, .norm = T)
top.cross.plot(twb.top)

More complex similarity measures

tcR also provides more complex similarity measures for evaluating the similarity of sets.

Tversky index (repOverlap(your_data, 'tversky') for clonesets or tversky.index for vectors) is an asymmetric similarity measure on sets that compares a variant to a prototype. If using default arguments, it's similar to Dice's coefficient.
Overlap coefficient (repOverlap(your_data, 'overlap') for clonesets or overlap.coef for vectors) is a similarity measure that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets.
Jaccard index (repOverlap(your_data, 'jaccard') for clonesets or jaccard.index for vectors) is a statistic used for comparing the similarity and diversity of sample sets.
Morisita's overlap index (repOverlap(your_data, 'morisita') for clonesets or morisitas.index for other data) is a statistical measure of dispersion of individuals in a population and is used to compare overlap among samples. The formula is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats (i.e. different faunas) (Morisita, 1959).

To visualise similarity among repertoires the vis.heatmap function is appropriate.

# Apply the Morisitas overlap index to the each pair of repertoires.
# Use information about V genes (i.e. one CDR3 clonotype is equal to another
# if and only if their CDR3 aa sequences are equal and their V genes are equal)
repOverlap(twb, 'morisita', 'aa', 'read.count', .vgene = T, .verbose = F)

Overlap statistics and tests

Overlap Z-score (OZ-score) - a measure for ???abnormality??? in overlaps

ozScore

Monte Carlo permutation test for pairwise and one-vs-all-wise within- and inter-group differences in a set of repertoires

permutDistTest

pca2euclid

Shared repertoire

To investigate a shared among a several repertoires clonotypes ("shared repertoire") the package provided the shared.repertoire function along with functions for computing the shared repertoire statistics. The shared.representation function computes the number of shared clonotypes for each repertoire for each degree of sharing (i.e., number of people, in which indicated amount of clones have been found). The function shared.summary is equivalent to repOverlap(, 'exact') but applies to the shared repertoire data frame. Measuring distances among repertoires using the cosine similarity on vector of counts of shared sequences is also possible with the cosine.sharing function.

# Compute shared repertoire of amino acid CDR3 sequences and V genes
# which has been found in two or more people and return the Read.count column
# of such clonotypes from each data frame in the input list.
imm.shared <- shared.repertoire(.data = twb, .type = 'avrc', .min.ppl = 2, .verbose = F)
head(imm.shared)
shared.representation(imm.shared)  # Number of shared sequences.

Diversity evaluation

For assessing the distribution of clonotypes in the given repertoire, tcR provides functions for evaluating the diversity (functions diversity and inverse.simpson) and the skewness of the clonal distribution (functions gini and gini.simpson), and a general interface to all of this functions repDiversity, which user should use to estimate the diversity of clonesets. Function diversity (repDiversity(your_clonesets, "div")) computes the ecological diversity index (with parameter .q for penalties for clones with large count). Function inverse.simpson (repDiversity(your_clonesets, "inv.simp")) computes the Inverse Simpson Index (i.e., inverse probability of choosing two similar clonotypes). Function gini (repDiversity(your_clonesets, "gini")) computes the economical Gini index of clonal distribution. Function gini.simpson (repDiversity(your_clonesets, "gini.simp")) computes the Gini-Simpson index. Function chao1 (repDiversity(your_clonesets, "chao1")) computes the Chao1 index, its SD and two 95 perc CI. Function repDiversity accepts single clonesets as well as a list of clonesets. Parameter .quant specifies which column to use for computing the diversity (print ?repDiversity to see more information about input arguments).

# Evaluate the diversity of clones by the ecological diversity index.
repDiversity(twb, 'div', 'read.count')
sapply(twb, function (x) diversity(x$Read.count))

# Compute the diversity as the inverse probability of choosing two similar clonotypes.
repDiversity(twb, 'inv.simp', 'read.prop')
sapply(twb, function (x) inverse.simpson(x$Read.proportion))

# Evaluate the skewness of clonal distribution.
repDiversity(twb, 'gini.simp', 'read.prop')
sapply(twb, function (x) gini.simpson(x$Read.proportion))

# Compute diversity of repertoire using Chao index.
repDiversity(twb, 'chao1', 'read.count')
sapply(twb, function (x) chao1(x$Read.count))

Visualisation

CDR3 length and read count distributions plot

Plots of the distribution of CDR3 nucleotide sequences length (function vis.count.len) and the histogram of counts (function vis.number.count). Input data is either a data frame or a list with data frames. Argument .col specifies column's name with clonotype counts. Argument .ncol specifies a number of columns in a plot with multiple distribution, i.e., if the input data is a list with data frames.

vis.count.len(twb[[1]], .name = "twb[[1]] CDR3 lengths", 
              .col = "Read.count")

# I comment this to avoid a strange bug in ggplot2. Will uncomment later.
# vis.number.count(twb[[1]], .name = "twb[[1]] count distribution")

Top proportions bar plot

For the visualisation of proportions of the most abundant clonotypes in a repertoire tcR offers the vis.top.proportions function. As input the function receives either data frame or a list with data frames (argument .data), an integer vector with number of clonotypes for computing proportions of count for this clonotypes (argument .head), and a column's name with clonotype counts (argument .col).

vis.top.proportions(twb, c(10, 500, 3000, 10000), .col = "Read.count")

Clonal space homeostasis bar plot

For the visualisation of how much space occupied each group of clonotypes, divided into groups by their proportions in the data, use the vis.clonal.space function. As an input it receives the output of the clonal.space.homeostasis function.

twb.space <- clonal.space.homeostasis(twb)
vis.clonal.space(twb.space)

Heat map

Pairwise distances or similarity of repertoires can be represented as qudratic matrices, in which each row and column represented a cloneset, and each value in every cell (i, j) is a distance between repertoires with indices i and j. One way to visalise such matrices is using "heatmaps". For plotting heatmaps in tcR implemented the vis.heatmap function. With changing input arguments user can change names of labs, title and legend.

twb.shared <- repOverlap(twb, "exact", .norm = F, .verbose = F)
vis.heatmap(twb.shared, .title = "Twins shared nuc clonotypes", 
            .labs = c("Sample in x", "Sample in y"), .legend = "# clonotypes")

Radar-like plot

Another way to repsent distances among objects is "radar-like" plots (because this plots is not exactly radar plots) realised in tcR throught the vis.radarlike function. Argument .ncol specifies a number of columns of radar-like plots in a viewport.

twb.js <- js.div.seg(twb, HUMAN_TRBV, .verbose = F) 
vis.radarlike(twb.js, .ncol = 2)

Gene usage histogram

For the visualisation of gene usage tcR employes subroutines for making classical histograms using the vis.gene.usage function. The function accept clonesets, lists of clonesets or output from the geneUsage function. If input is a cloneset(s), then user should specify a gene alphabet (e.g., HUMAN_TRBV) in order to compute the gene usage. Using a parameter \code{.dodge}, user can change type of the output between an output as histograms for each cloneset in the input list (.dodge = F) or an output as an one histogram for all data, which is very useful for comparing distribution of genes (.dodge = T). If .dodge=F and input are lists of clonesets or a gene usage of a few clonesets, than user with argument .ncol can specify how many columns of histograms will be outputted. With .coord.flip user can flip coordinates so genes will be at the left side of the plot.

vis.gene.usage(twb[[1]], HUMAN_TRBV, .main = 'Sample I V-usage')

vis.gene.usage(twb[[2]], HUMAN_TRBV, .main = 'Sample II V-usage', .coord.flip = T)

twb.jusage <- geneUsage(twb, HUMAN_TRBJ)
vis.gene.usage(twb.jusage, .main = 'Twins J-usage', .dodge = T)

vis.gene.usage(twb, HUMAN_TRBJ, .main = 'Twins J-usage', .dodge = F, .ncol = 2)

PCA visualisation

For the visualisation of results from the prcomp function (i.e., objects of class prcomp), tcR provides the vis.pca function. Input arguments for the function are an object of class prcomp and a (if needed) list with groups (vectors of indices of samples) for colouring points in the plot.

twb.pca <- pca.segments(twb, .do.plot = F) 
vis.pca(pca.segments(twb, .do.plot = F, .genes = HUMAN_TRBV), .groups = list(GroupA = c(1,2), GroupB = c(3,4)))

Logo-like plot

Logo-like graphs for visualisation of nucleotide or amino acid motif sequences / profiles.

km <- get.kmers(twb[[1]]$CDR3.amino.acid.sequence, .head = 100, .k = 7, .verbose = F)
d <- kmer.profile(km)
vis.logo(d)

Mutation networks

Mutation network (or a mutation graph) is a graph with vertices representing nucleotide or in-frame amino acid sequences (out-of-frame amino acid sequences will be automatically filtered out by tcR functions for mutation network creating) and edges which connecting pairs of sequences with hamming distance (parameter .method = 'hamm') or edit distance (parameter .method = 'lev') between them no more than specified in the .max.errors function parameter of the mutation.network function. To create a mutation network first what you need is to make a shared repertoires and then apply the mutation.network function to this shared repertoire:

# data(twb)
twb.shared <- shared.repertoire(twb, .head = 1000, .verbose = F)
G <- mutation.network(twb.shared)
G

To manipulate vertex attributes functions \code{set.group.vector} and \code{get.group.names} are provided.

# data(twb)
# twb.shared <- shared.repertoire(twb, .head = 1000)
# G <- mutation.network(twb.shared)
G <- set.group.vector(G, "twins", list(A = c(1,2), B = c(3,4)))  # <= refactor this
get.group.names(G, "twins", 1)
get.group.names(G, "twins", 300)
get.group.names(G, "twins", c(1,2,3), F)
get.group.names(G, "twins", 300, F)

# Because we have only two groups, we can assign more readable attribute.
V(G)$twin.names <- get.group.names(G, "twins")
V(G)$twin.names[1]
V(G)$twin.names[300]

To access neighbour vertices of vertices ("ego-network") use the \code{mutation.neighbours} function:

# data(twb)
# twb.shared <- shared.repertoire(twb, .head = 1000)
# G <- mutation.network(twb.shared)
head(mutated.neighbours(G, 1)[[1]])

Conclusion

Feel free to contact me for the package-related or immunoinformatics research-related questions.

Appendix A: Kmers manipulation and processing

In the package implemented functions for working with k-mers. Function get.kmers generates k-mers from the given chatacter vector or a data frame with columns for sequences and a count for each sequence.

head(get.kmers(twb[[1]]$CDR3.amino.acid.sequence, 100, .meat = F, .verbose = F))
head(get.kmers(twb[[1]], .meat = T, .verbose = F))

Appendix B: Nucleotide and amino acid sequences manipulation

The package also provides a several number of functions for performing classic bioinformatics tasks on strings. For more powerful subroutines see the Bioconductor's Biostrings package.

Nucleotide sequence manipulation

Functions for basic nucleotide sequences manipulations: reverse-complement, translation and GC-content computation. All functions are vectorised.

revcomp(c('AAATTT', 'ACGTTTGGA'))
cbind(bunch.translate(twb[[1]]$CDR3.nucleotide.sequence[1:10]),
      twb[[1]]$CDR3.amino.acid.sequence[1:10])
gc.content(twb[[1]]$CDR3.nucleotide.sequence[1:10])

Reverse translation subroutines

Function codon.variants returns a list of vectors of nucleotide codons for each letter for each input amino acid sequence. Function translated.nucl.sequences returns the number of nucleotide sequences, which, when translated, will result in the given amino acid sequence(s). Function reverse.translation return all nucleotide sequences, which is translated to the given amino acid sequences. Optional argument .nucseq for each of this function provides restriction for nucleotides, which cannot be changed. All functions are vectorised.

codon.variants('LQ')
translated.nucl.sequences(c('LQ', 'CASSLQ'))
reverse.translation('LQ')
translated.nucl.sequences('LQ', 'XXXXXG')
codon.variants('LQ', 'XXXXXG')
reverse.translation('LQ', 'XXXXXG')

Any scripts or data that you put into this service are public.

tcR documentation built on July 2, 2020, 3:18 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tcR Advanced Data Analysis of Immune Receptor Repertoires

<center> <h1>tcR: a package for T cell receptor and Immunoglobulin repertoires advanced data analysis</h1> In tcR: Advanced Data Analysis of Immune Receptor Repertoires