Description Usage Arguments Details Value
processVCF
takes a VCF of SNPs (preferably from dbSNP) and turns them
into a set of labeled MPRA sequences barcoded with inert n-mers
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | processVCF(
vcf,
nper,
upstreamContextRange,
downstreamContextRange,
fwprimer,
revprimer,
enzyme1 = "GGTACC",
enzyme2 = "TCTAGA",
enzyme3 = "GGCCNNNNNGGCC",
filterPatterns = "AATAAA",
alter_aberrant = FALSE,
extra_elements = FALSE,
max_construct_size = NULL,
barcode_set = "twelvemers",
ensure_all_4_nuc = TRUE,
flip_RV = TRUE,
outPath = NULL
)
|
vcf |
the path to the input VCF |
nper |
the number of barcoded sequences to be generated per allele per SNP |
upstreamContextRange |
the amount of sequence context to acquire upstream of the SNP |
downstreamContextRange |
the amount of sequence context to acquire downstream of the SNP |
fwprimer |
a string containing the forward PCR primer to be used |
revprimer |
a string containing the reverse PCR primer to be used |
enzyme1 |
a string containing the pattern for the first restriction enzyme. Defaults to KpnI. |
enzyme2 |
a string containing the pattern for the second restriction enzyme. Defaults to XbaI. |
enzyme3 |
a string containing the pattern for the third restriction enzyme. Defaults to SfiI. |
filterPatterns |
a character vector of patterns to filter out of the barcode pool (along with their reverse complements) |
alter_aberrant |
under development - logical indicating whether to randomly alter aberrant digestion sites across barcodes |
extra_elements |
under development - logical indicating whether to include the extra TG / GGC as shown on the sequence diagram on the shiny app |
max_construct_size |
under development - integer indicating the maximum construct size to generate. If provided, constructs that end up longer than this have sequence context evenly removed from both sides until sufficiently short. |
barcode_set |
string - indicating the barcode set to use. Alternatively a vector containing custom barcodes. See below for details. |
ensure_all_4_nuc |
logical – if true, barcodes are filtered to only those containaing all four nucleotides. |
flip_RV |
logical - if true, take the reverse complement of any alleles with "RV" in the INFO field. This is to account for SNPs that are encoded in terms of the reverse strand alleles in dbSNP. |
outPath |
an optional path stating where to write a .tsv of the results |
The "filterPatterns"
argument is used to remove barcodes
containing patterns that may perform badly in a MPRA setting. For example,
the default, 'AATAAA', corresponds to a sequence required for cleavage and
polyadenylation of pre-mRNAs in eukaryotic cells.
The upstreamContextRange
and downstreamContextRange
arguments
are handled intuitively for minus strand SNPs (i.e. those that have the
MPRAREV tag). So for a minus strand SNP you get the complement of
downstreamContextRange
- SNP - upstreamContextRange
as the
genomic context.
The three enzyme
arguments may contain ambiguous nucleotides by
including an N character at the appropriate base (for example the 5 N's in
the SfiI default).
The sequence for enzyme3
does not show up in the output sequences,
however it is necessary to check for it's presence in the output sequences
as it is used when preparing the plasmid library. Aberrant enzyme3
sites could cause the library preparation to fail.
Alternative barcode sets may be used by specifying the barcode_set
argument to processVCF
one of the following values. The first number
indicates the length of the barcodes in basepairs, the second indicates the
number of errors correctable while still being able to identify the
original barcode. These are provided by the freebarcodes package, detailed
at the publication below and available from the subsequently listed github
repository. The original barcode set provided with mpradesigntools is
available as the twelvemers
barcode set. See the README on github
for a listing of the number of barcodes available per set. The freebarcodes
sets only meet the traditional MPRA barcode requirements to varying degree.
contains all four nucleotides
doesn't contain runs of 4 or more of the same nucleotide
doesn't contain miR seed sequences
Alternatively, barcode_set
can be a character vector containing a
custom set of all barcodes you'd like to use.
A list of two data_frames. The first, named 'result', is a data_frame containing the labeled MPRA sequences. The second, named 'failed', is a data_frame listing the SNPs that are not able to have MPRA sequences generated and the reason why.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.