processVCF: Process VCF into MPRA sequences

Description Usage Arguments Details Value

Description

processVCF takes a VCF of SNPs (preferably from dbSNP) and turns them into a set of labeled MPRA sequences barcoded with inert n-mers

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
processVCF(
  vcf,
  nper,
  upstreamContextRange,
  downstreamContextRange,
  fwprimer,
  revprimer,
  enzyme1 = "GGTACC",
  enzyme2 = "TCTAGA",
  enzyme3 = "GGCCNNNNNGGCC",
  filterPatterns = "AATAAA",
  alter_aberrant = FALSE,
  extra_elements = FALSE,
  max_construct_size = NULL,
  barcode_set = "twelvemers",
  ensure_all_4_nuc = TRUE,
  flip_RV = TRUE,
  outPath = NULL
)

Arguments

vcf

the path to the input VCF

nper

the number of barcoded sequences to be generated per allele per SNP

upstreamContextRange

the amount of sequence context to acquire upstream of the SNP

downstreamContextRange

the amount of sequence context to acquire downstream of the SNP

fwprimer

a string containing the forward PCR primer to be used

revprimer

a string containing the reverse PCR primer to be used

enzyme1

a string containing the pattern for the first restriction enzyme. Defaults to KpnI.

enzyme2

a string containing the pattern for the second restriction enzyme. Defaults to XbaI.

enzyme3

a string containing the pattern for the third restriction enzyme. Defaults to SfiI.

filterPatterns

a character vector of patterns to filter out of the barcode pool (along with their reverse complements)

alter_aberrant

under development - logical indicating whether to randomly alter aberrant digestion sites across barcodes

extra_elements

under development - logical indicating whether to include the extra TG / GGC as shown on the sequence diagram on the shiny app

max_construct_size

under development - integer indicating the maximum construct size to generate. If provided, constructs that end up longer than this have sequence context evenly removed from both sides until sufficiently short.

barcode_set

string - indicating the barcode set to use. Alternatively a vector containing custom barcodes. See below for details.

ensure_all_4_nuc

logical – if true, barcodes are filtered to only those containaing all four nucleotides.

flip_RV

logical - if true, take the reverse complement of any alleles with "RV" in the INFO field. This is to account for SNPs that are encoded in terms of the reverse strand alleles in dbSNP.

outPath

an optional path stating where to write a .tsv of the results

Details

The "filterPatterns" argument is used to remove barcodes containing patterns that may perform badly in a MPRA setting. For example, the default, 'AATAAA', corresponds to a sequence required for cleavage and polyadenylation of pre-mRNAs in eukaryotic cells.

The upstreamContextRange and downstreamContextRange arguments are handled intuitively for minus strand SNPs (i.e. those that have the MPRAREV tag). So for a minus strand SNP you get the complement of downstreamContextRange - SNP - upstreamContextRange as the genomic context.

The three enzyme arguments may contain ambiguous nucleotides by including an N character at the appropriate base (for example the 5 N's in the SfiI default).

The sequence for enzyme3 does not show up in the output sequences, however it is necessary to check for it's presence in the output sequences as it is used when preparing the plasmid library. Aberrant enzyme3 sites could cause the library preparation to fail.

Alternative barcode sets may be used by specifying the barcode_set argument to processVCF one of the following values. The first number indicates the length of the barcodes in basepairs, the second indicates the number of errors correctable while still being able to identify the original barcode. These are provided by the freebarcodes package, detailed at the publication below and available from the subsequently listed github repository. The original barcode set provided with mpradesigntools is available as the twelvemers barcode set. See the README on github for a listing of the number of barcodes available per set. The freebarcodes sets only meet the traditional MPRA barcode requirements to varying degree.

Alternatively, barcode_set can be a character vector containing a custom set of all barcodes you'd like to use.

Value

A list of two data_frames. The first, named 'result', is a data_frame containing the labeled MPRA sequences. The second, named 'failed', is a data_frame listing the SNPs that are not able to have MPRA sequences generated and the reason why.


andrewGhazi/mpradesigntools documentation built on Dec. 21, 2020, 3:18 p.m.