processVCF: Process VCF into MPRA sequences
In andrewGhazi/mpradesigntools: tools for designing MPRA experiments

Description Usage Arguments Details Value

processVCF takes a VCF of SNPs (preferably from dbSNP) and turns them into a set of labeled MPRA sequences barcoded with inert n-mers

processVCF(
  vcf,
  nper,
  upstreamContextRange,
  downstreamContextRange,
  fwprimer,
  revprimer,
  enzyme1 = "GGTACC",
  enzyme2 = "TCTAGA",
  enzyme3 = "GGCCNNNNNGGCC",
  filterPatterns = "AATAAA",
  alter_aberrant = FALSE,
  extra_elements = FALSE,
  max_construct_size = NULL,
  barcode_set = "twelvemers",
  ensure_all_4_nuc = TRUE,
  flip_RV = TRUE,
  outPath = NULL
)

`vcf`	the path to the input VCF
`nper`	the number of barcoded sequences to be generated per allele per SNP
`upstreamContextRange`	the amount of sequence context to acquire upstream of the SNP
`downstreamContextRange`	the amount of sequence context to acquire downstream of the SNP
`fwprimer`	a string containing the forward PCR primer to be used
`revprimer`	a string containing the reverse PCR primer to be used
`enzyme1`	a string containing the pattern for the first restriction enzyme. Defaults to KpnI.
`enzyme2`	a string containing the pattern for the second restriction enzyme. Defaults to XbaI.
`enzyme3`	a string containing the pattern for the third restriction enzyme. Defaults to SfiI.
`filterPatterns`	a character vector of patterns to filter out of the barcode pool (along with their reverse complements)
`alter_aberrant`	under development - logical indicating whether to randomly alter aberrant digestion sites across barcodes
`extra_elements`	under development - logical indicating whether to include the extra TG / GGC as shown on the sequence diagram on the shiny app
`max_construct_size`	under development - integer indicating the maximum construct size to generate. If provided, constructs that end up longer than this have sequence context evenly removed from both sides until sufficiently short.
`barcode_set`	string - indicating the barcode set to use. Alternatively a vector containing custom barcodes. See below for details.
`ensure_all_4_nuc`	logical – if true, barcodes are filtered to only those containaing all four nucleotides.
`flip_RV`	logical - if true, take the reverse complement of any alleles with "RV" in the INFO field. This is to account for SNPs that are encoded in terms of the reverse strand alleles in dbSNP.
`outPath`	an optional path stating where to write a .tsv of the results

The "filterPatterns" argument is used to remove barcodes containing patterns that may perform badly in a MPRA setting. For example, the default, 'AATAAA', corresponds to a sequence required for cleavage and polyadenylation of pre-mRNAs in eukaryotic cells.

The upstreamContextRange and downstreamContextRange arguments are handled intuitively for minus strand SNPs (i.e. those that have the MPRAREV tag). So for a minus strand SNP you get the complement of downstreamContextRange - SNP - upstreamContextRange as the genomic context.

The three enzyme arguments may contain ambiguous nucleotides by including an N character at the appropriate base (for example the 5 N's in the SfiI default).

The sequence for enzyme3 does not show up in the output sequences, however it is necessary to check for it's presence in the output sequences as it is used when preparing the plasmid library. Aberrant enzyme3 sites could cause the library preparation to fail.

Alternative barcode sets may be used by specifying the barcode_set argument to processVCF one of the following values. The first number indicates the length of the barcodes in basepairs, the second indicates the number of errors correctable while still being able to identify the original barcode. These are provided by the freebarcodes package, detailed at the publication below and available from the subsequently listed github repository. The original barcode set provided with mpradesigntools is available as the twelvemers barcode set. See the README on github for a listing of the number of barcodes available per set. The freebarcodes sets only meet the traditional MPRA barcode requirements to varying degree.

contains all four nucleotides
doesn't contain runs of 4 or more of the same nucleotide
doesn't contain miR seed sequences

Alternatively, barcode_set can be a character vector containing a custom set of all barcodes you'd like to use.

A list of two data_frames. The first, named 'result', is a data_frame containing the labeled MPRA sequences. The second, named 'failed', is a data_frame listing the SNPs that are not able to have MPRA sequences generated and the reason why.

andrewGhazi/mpradesigntools documentation built on Dec. 21, 2020, 3:18 p.m.