assayize: Prepare SNPs for assay orders (calculate SNP flanking...

Description Usage Arguments Value Examples

View source: R/functions.R

Description

This function takes, as input, data frames of:

  1. all the known variation in different (typically short) contigs,

  2. the SNPs you want to turn into assays,

  3. the sequences of all the contigs (or at least the ones that you want to turn into assays).

It returns information needed for screening variation and ordering SNP assays.

Usage

1
2
3
4
5
6
7
8
9
assayize(
  V,
  targets,
  consSeq = NULL,
  reqDist = 20,
  reqDistFromEnd = 40,
  GCmax = 0.65,
  allVar = TRUE
)

Arguments

V

a data frame of all the variants detected in sequences from individuals. Typically this will be from a VCF file (maybe just the first few columns). It can have any number of columns, but it must have the following:

CHROM

The name of the contiguous piece of DNA that the variant is found in. This will typically be a RAD locus identifier, etc.

POS

The position, starting from 1, of this particular variant within the CHROM DNA.

REF

The nucleotide base of the reference sequence at each of the positions. Must be uppercase A, C, G, or T

ALT

The alternate base(s) at each of the positions. Must be uppercase, A, C, G, or T, or, if more than 1 alternate base is present, they must be listed, and comma-separated, and there can be no more than 3 alternate bases, (or perhaps a "-" or ".", though in those cases you probably don't want to be developing a SNP assay, anyway.)

Any columns other than CHROM, POS, REF and ALT will be ignored. If consSeq is NULL then V must also have a LENGTH column that gives the number of bases in the contig.

targets

A data frame of the variants that are targets for assay development. This can also have whatever columns that are desired, but it must have CHROM and POS, concordant with V, and it cannot have a column LENGTH in it. The extra columns in this data frame will all be represented in the output.

consSeq

A data frame of consensus sequences. Must have one column CHROM, exactly concordant with the naming conventions of V and targets, and one column Seq which holds the consensus sequences as strings. The sequence should consist of capital A, C, G, or T. There can be more columns, but they will be ignored. If this is NULL, then the function returns just the flanking information summaries, without the sequence from which to build assays. However, in that case, there must be a LENGTH column in V.

reqDist

The required minimum number of bases between a target SNP and the nearest flanking SNP for a target to be designable (according to whomever's criterion).

reqDistFromEnd

The required number of bases between the SNP site and the start or end of the contig for the SNP to be designable as an assay.

GCmax

The maximum GC content of the reference contig allowable for an assay to still be considered designable.

allVar

Logical indicating whether all variable sites within the CHROMs containing the targets should be returned, or just the targets themselves.

Value

Returns a data frame with all the columns in targets along with additional columns appended to it. Namely, you can expect the following columns.

CHROM

The contig identifier

POS

Position of the focal SNP within the contig

LENGTH

Length of the contig.

REF

The reference base at the site

ALT

The alternative base(s) at the site

LeftFlank

Number of SNP-free bases to the "left" of the focal SNP

RightFlank

Number of SNP-free bases to the "right" of the focal SNP

GC_content

Fraction of sites in the reference sequence that are G's or C's. Does not appear if consSeq is NULL

Designable

Logical indicating whether this focal SNP is designable given the criteria specified in reqDist, reqDistFromEnd, and, if consSeq is not NULL, also given GCmax.

BasesPresent

All the bases (or other variation) observed at the SNP in sorted order

IUPAC

The IUPAC code that describes the variation observed at the focal SNP

Seq

The sequence of the contig as prepared for assay order. Variation at non-focal SNPs is given by IUPAC codes while variation at the focal SNP is specified like so: [T/G]. This column is not present if consSeq is NULL.

Any other columns that were present in targets

Whatever extra columns that were in targets that were not duplicates of the above colums get stuck on in after CHROM and POS

If allVar is set to FALSE then only rows corresponding to SNPs in targets will be returned. If allVar is TRUE, then all the SNPs in V that fall in the contigs within which the SNPs in targets fall will be returned (one row for each). In that case, the "Any other columns that were present in targets" will be NA.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
library(dplyr)

# get the vcf file:
vcf <- grab_vcf(system.file("textdata", "vcf.txt.gz", package = "snps2assays"))

# get the fasta file
fasta <- grab_fasta(system.file("textdata", "fasta.txt.gz", package = "snps2assays"))

# get our data frame of target SNPs
data(example_target_snps)

# now, assayize them!
assays <- assayize(vcf, example_target_snps, fasta)
assays

# here we run it without the consensus sequences.  The length of each contig is part
# of the RAD locus name (the 6th "_"-separated field), so we can use that:
vcf2 <- vcf %>%
 mutate(LENGTH = stringr::str_split(CHROM, "_") %>%
   lapply("[", 6) %>%
   unlist %>%
   as.numeric # Note that if LENGTH is not numeric assayize will throw an error!
   )

# now that vcf2 has a LENGTH column, we can assayize it
# and for fun, let's require more distance around the SNP
# and only return the target SNPs
assays2 <- assayize(vcf2, example_target_snps, reqDist = 25, allVar = FALSE)
assays2

eriqande/snps2assays documentation built on Oct. 9, 2020, 5:22 p.m.