View source: R/linkMultipleVariants.R
linkMultipleVariants | R Documentation |
This function enables the processing of data sets with multiple
variable sequences, which should potentially be handled in different
ways. For example, a barcode association experiment
with two variable sequences (the barcode and the biological variant)
that need to be processed differently, e.g. in terms of matching to
wildtype sequences or collapsing of similar sequences.
In contrast, while digestFastqs
allow the specification
of multiple variable sequences (within each of the forward and reverse
reads), they will be concatenated and processed as a single unit.
linkMultipleVariants(combinedDigestParams = list(), ...)
combinedDigestParams |
A named list of arguments to
|
... |
Additional arguments providing arguments to |
linkMultipleVariants will process the input in the following way:
First, run digestFastqs
with the parameters provided
in combinedDigestParams
. Typically, this will be a
"naive" counting run, where the frequencies of all observed
variants are tabulated. The variable sequences
within the forward and reverse reads, respectively, will be
processed as a single sequence.
Next, run digestFastqs
with each of the additional
parameter sets provided (...
). Each of these should
correspond to a single variable sequence from the combined
run (i.e., if there are two Vs in the element specifications
in the combined run, there should be two additional
parameter sets provided, each corresponding to the
processing of one variable sequence part). It is assumed
that the order of the additional arguments correspond to the
order of the variable sequences in the combined run, in such a way
that if the variable sequences extracted in each of the separate
runs are concatenated in the order that the parameter sets are
provided to linkMultipleVariants
, they will form the variable
sequence extracted in the combined run.
The result of each of the separate runs is a 'conversion table', containing the final set of identified sequence variants as well as all individual sequences corresponding to each of them. This is then combined with the count table from the combined, "naive" run in order to create an aggregated count table. More precisely, each sequence in the combined run is split into the constituent variable sequences, and each variable sequence is then matched to the output from the right separate run, from which the final feature ID (mutant name, or collapsed sequence) will be extracted and used to replace the original sequence in the combined count table. Once all the matches are done, rows with NAs (where no match could be found in the separate run) are removed and the counts are aggregated across all identical combinations of variable sequences.
In order to define the elementsForward
and elementsReverse
arguments for the separate runs, a strategy that often works is to simply
copy the arguments from the combined run, and successively replace all
but one of the 'V's by 'S'. This will effectively process one variable
sequence at the time, while keeping all other elements of the reads
consistent (since this can affect e.g. filtering criteria). Note that
to process individual variable sequences in the reverse read, you also
need to swap the 'forward' and 'reverse' specifications (since
digestFastqs
requires a forward read).
A list with the following elements:
countAggregated - a tibble
with columns corresponding to
each of the variable sequences, and a column with the total observed
read count for the combination.
convSeparate - a list of conversion tables from the respective separate runs.
outCombined - the digestFastqs
output for the combined run.
Charlotte Soneson, Michael Stadler
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.