View source: R/MutationProfiling.R
collapseClones | R Documentation |
collapseClones
creates effective input and germline sequences for each clonal
group and appends columns containing the consensus sequences to the input
data.frame
.
collapseClones(
db,
cloneColumn = "clone_id",
sequenceColumn = "sequence_alignment",
germlineColumn = "germline_alignment_d_mask",
muFreqColumn = NULL,
regionDefinition = NULL,
method = c("mostCommon", "thresholdedFreq", "catchAll", "mostMutated", "leastMutated"),
minimumFrequency = NULL,
includeAmbiguous = FALSE,
breakTiesStochastic = FALSE,
breakTiesByColumns = NULL,
expandedDb = FALSE,
nproc = 1,
juncLengthColumn = "junction_length",
fields = NULL
)
db |
|
cloneColumn |
|
sequenceColumn |
|
germlineColumn |
|
muFreqColumn |
|
regionDefinition |
RegionDefinition object defining the regions
and boundaries of the Ig sequences. Optional. Default is
|
method |
method for calculating input consensus sequence. Required.
One of |
minimumFrequency |
frequency threshold for calculating input consensus sequence.
Applicable to and required for the |
includeAmbiguous |
whether to use ambiguous characters to represent positions
at which there are multiple characters with frequencies that
are at least |
breakTiesStochastic |
In case of ties, whether to randomly pick a sequence from
sequences that fulfill the criteria as consensus. Applicable
to and required for all methods except for |
breakTiesByColumns |
A list of the form
|
expandedDb |
|
nproc |
Number of cores to distribute the operation over. If the
|
juncLengthColumn |
|
fields |
additional fields used for grouping. Use sample_id, to avoid combining sequences with the same clone_id that belong to different sample_id. |
A modified db
with the following additional columns:
clonal_sequence
: effective sequence for the clone.
clonal_germline
: germline sequence for the clone.
clonal_sequence_mufreq
: mutation frequency of
clonal_sequence
; only added for the "mostMutated"
and "leastMutated"
methods.
clonal_sequence
is generated with the method of choice indicated
by method
, and clonal_germline
is generated with the
"mostCommon"
method, along with, where applicable, user-defined
parameters such as minimumFrequency
, includeAmbiguous
,
breakTiesStochastic
, and breakTiesByColumns
.
For each clone, clonal_sequence
and
clonal_germline
have the same length.
For the "thresholdedFreq"
, "mostCommon"
, and
"catchAll"
methods:
The length of the consensus sequences is determined by the longest possible
consensus sequence (baesd on inputSeq
and germlineSeq
) and
regionDefinition@seqLength
(if supplied), whichever is shorter.
Given a set of sequences of potentially varying lengths, the longest possible
length of their consensus sequence is taken to be the longest length along
which there is information contained at every nucleotide position across
majority of the sequences. Majority is defined to be greater than
floor(n/2)
, where n
is the number of sequences. If the longest
possible consensus length is 0, there will be a warning and an empty string
(""
) will be returned.
If a length limit is defined by supplying a regionDefinition
via
regionDefinition@seqLength
, the consensus length will be further
restricted to the shorter of the longest possible length and
regionDefinition@seqLength
.
For the "mostMutated"
and "leastMutated"
methods:
The length of the consensus sequences depends on that of the most/least
mutated input sequence, and, if supplied, the length limit defined by
regionDefinition@seqLength
, whichever is shorter. If the germline
consensus computed using the "mostCommon"
method is longer than
the most/least mutated input sequence, the germline consensus is trimmed
to be of the same length as the input consensus.
The descriptions below use "sequences" as a generalization of input sequences and germline sequences.
method="thresholdedFreq"
A threshold must be supplied to the argument minimumFrequency
. At
each position along the length of the consensus sequence, the frequency
of each nucleotide/character across sequences is tabulated. The
nucleotide/character whose frequency is at least (i.e. >=
)
minimumFrequency
becomes the consensus; if there is none, the
consensus nucleotide will be "N"
.
When there are ties (frequencies of multiple nucleotides/characters
are at least minimumFrequency
), this method can be deterministic
or stochastic, depending on additional parameters.
With includeAmbiguous=TRUE
, ties are resolved
deterministically by representing ties using ambiguous
characters. See "Choosing ambiguous characters" for how
ambiguous characters are chosen.
With breakTiesStochastic=TRUE
, ties are resolved
stochastically by randomly picking a character amongst the
ties.
When both TRUE
, includeAmbiguous
takes
precedence over breakTiesStochastic
.
When both FALSE
, the first character from the ties is
taken to be the consensus following the order of "A"
,
"T"
, "G"
, "C"
, "N"
, "."
,
and "-"
.
Below are some examples looking at a single position based on 5
sequences with minimumFrequency=0.6
,
includeAmbiguous=FALSE
, and breakTiesStochastic=FALSE
:
If the sequences have "A"
, "A"
, "A"
,
"T"
, "C"
, the consensus will be "A"
,
because "A"
has frequency 0.6, which is at least
minimumFrequency
.
If the sequences have "A"
, "A"
, "T"
,
"T"
, "C"
, the consensus will be "N"
,
because none of "A"
, "T"
, or "C"
has
frequency that is at least minimumFrequency
.
method="mostCommon"
The most frequent nucleotide/character across sequences at each position along the length of the consensus sequence makes up the consensus.
When there are ties (multiple nucleotides/characters with equally
maximal frequencies), this method can be deterministic or stochastic,
depending on additional parameters. The same rules for breaking ties
for method="thresholdedFreq"
apply.
Below are some examples looking at a single position based on 5
sequences with includeAmbiguous=FALSE
, and
breakTiesStochastic=FALSE
:
If the sequences have "A"
, "A"
, "T"
,
"A"
, "C"
, the consensus will be "A"
.
If the sequences have "T"
, "T"
, "C"
,
"C"
, "G"
, the consensus will be "T"
,
because "T"
is before "C"
in the order of
"A"
, "T"
, "G"
, "C"
, "N"
,
"."
, and "-"
.
method="catchAll"
This method returns a consensus sequence capturing most of the information contained in the sequences. Ambiguous characters are used where applicable. See "Choosing ambiguous characters" for how ambiguous characters are chosen. This method is deterministic and does not involve breaking ties.
Below are some examples for method="catchAll"
looking at a
single position based on 5 sequences:
If the sequences have "N"
, "N"
, "N"
,
"N"
, "N"
, the consensus will be "N"
.
If the sequences have "N"
, "A"
, "A"
,
"A"
, "A"
, the consensus will be "A"
.
If the sequences have "N"
, "A"
, "G"
,
"A"
, "A"
, the consensus will be "R"
.
If the sequences have "-"
, "-"
, "."
,
"."
, "."
, the consensus will be "-"
.
If the sequences have "-"
, "-"
, "-"
,
"-"
, "-"
, the consensus will be "-"
.
If the sequences have "."
, "."
, "."
,
"."
, "."
, the consensus will be "."
.
method="mostMutated"
and method="leastMutated"
These methods return the most/least mutated sequence as the consensus sequence.
When there are ties (multple sequences have the maximal/minimal mutation frequency), this method can be deterministic or stochastic, depending on additional parameters.
With breakTiesStochastic=TRUE
, ties are resolved
stochastically by randomly picking a sequence out of
sequences with the maximal/minimal mutation frequency.
When breakTiesByColumns
is supplied, ties are resolved
deterministically. Column by column, a function is applied on
the column and sequences with column value matching the
functional value are retained, until ties are resolved or
columns run out. In the latter case, the first remaining
sequence is taken as the consensus.
When breakTiesStochastic=TRUE
and
breakTiesByColumns
is also supplied,
breakTiesStochastic
takes precedence over
breakTiesByColumns
.
When breakTiesStochastic=FALSE
and
breakTiesByColumns
is not supplied (i.e. NULL
),
the sequence that appears first amongst the ties is taken
as the consensus.
Ambiguous characters may be present in the returned consensuses when using the
"catchAll"
method and when using the "thresholdedFreq"
or
"mostCommon"
methods with includeAmbiguous=TRUE
.
The rules on choosing ambiguous characters are as follows:
If a position contains only "N"
across sequences, the consensus
at that position is "N"
.
If a position contains one or more of "A"
, "T"
,
"G"
, or "C"
, the consensus will be an IUPAC character
representing all of the characters present, regardless of whether
"N"
, "-"
, or "."
is present.
If a position contains only "-"
and "."
across sequences,
the consensus at that position is taken to be "-"
.
If a position contains only one of "-"
or "."
across
sequences, the consensus at that position is taken to be the character
present.
Note that this function does not perform multiple sequence alignment.
As a prerequisite, it is assumed that the sequences in
sequenceColumn
and germlineColumn
have been aligned
somehow. In the case of immunoglobulin repertoire analysis, this
usually means that the sequences are IMGT-gapped.
When using the "mostMutated"
and "leastMutated"
methods,
if you supply both muFreqColumn
and regionDefinition
,
it is your responsibility to ensure that the mutation frequency in
muFreqColumn
was calculated with sequence lengths restricted
to the same regionDefinition
you are supplying.
Otherwise, the "most/least mutated" sequence you obtain might not
be the most/least mutated given the regionDefinition
supplied,
because your mutation frequency was based on a
regionDefinition
different from the one supplied.
If you intend to run collapseClones
before
building a 5-mer targeting model, you must choose
parameters such that your collapsed clonal consensuses do
not include ambiguous characters. This is because the
targeting model functions do NOT support ambiguous characters
in their inputs.
See IMGT_SCHEMES for a set of predefined RegionDefinition objects.
# Subset example data
data(ExampleDb, package="alakazam")
db <- subset(ExampleDb, c_call %in% c("IGHA", "IGHG") & sample_id == "+7d" &
clone_id %in% c("3100", "3141", "3184"))
# thresholdedFreq method, resolving ties deterministically without using ambiguous characters
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="thresholdedFreq", minimumFrequency=0.6,
includeAmbiguous=FALSE, breakTiesStochastic=FALSE)
# mostCommon method, resolving ties deterministically using ambiguous characters
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostCommon",
includeAmbiguous=TRUE, breakTiesStochastic=FALSE)
# Make a copy of db that has a mutation frequency column
db2 <- observedMutations(db, frequency=TRUE, combine=TRUE)
# mostMutated method, resolving ties stochastically
clones <- collapseClones(db2, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostMutated", muFreqColumn="mu_freq",
breakTiesStochastic=TRUE, breakTiesByColumns=NULL)
# mostMutated method, resolving ties deterministically using additional columns
clones <- collapseClones(db2, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostMutated", muFreqColumn="mu_freq",
breakTiesStochastic=FALSE,
breakTiesByColumns=list(c("duplicate_count"), c(max)))
# Build consensus for V segment only
# Capture all nucleotide variations using ambiguous characters
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="catchAll", regionDefinition=IMGT_V)
# Return the same number of rows as the input
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostCommon", expandedDb=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.