There is one main function in this package assayize() that takes data frame input of the necessary ingredients (all variants, variants you want to target, and consensus sequences) and then spits out information useful for ordering assays and knowing whether they have a good chance of being designable.

There are also a few helper functions to get from common formats to the necessary data frames. For example, from VCF to a data frame of all the variants, or from FASTA to a data frame of consensus sequences. We demonstrate those in a series of workflows here.

Let's get going by loading the tidyverse and then loading the package

#library(tidyverse)
library(magrittr)
library(snps2assays)

Some Handy Functions and Example Data Sets

grab_vcf()

Let's say that you have all your variants in a big gzipped (or not) VCF file. Here are the top lines of the example VCF file that gets installed with the snps2assays package. They will run off the page, but you get the idea:

# print out the first two lines of data (after all the headers)
con <- gzfile(system.file("textdata", "vcf.txt.gz", package = "snps2assays"))
readLines(con, n = 20) %>%
  cat(sep = "\n")
close(con)

To read a VCF file (it can be gzipped, but it can't be a BCF file!) you can use the grab_vcf function which will slurp the whole thing into a tbl_df data frame.

grab_vcf(system.file("textdata", "vcf.txt.gz", package = "snps2assays"))

Note that this function searches for the header by trying to find a line that starts with "#CHROM", so, if your VCF files don't roll that way, you'll have problems.

Target SNPs

To go along with that example VCF file we have 25 example target SNPs:

example_target_snps

They might be a little truncated in the view above.

grab_fasta()

Let's look at the example fasta file that comes with the package and that corresponds to the sequences in the example VCF file.

# print out the first 6 lines of the fasta file
con <- gzfile(system.file("textdata", "fasta.txt.gz", package = "snps2assays"))
readLines(con, n = 10) %>%
  cat(sep = "\n")
close(con)

That is what the fasta file looks like. Here is how we can read it into a data frame with grab_fasta.

grab_fasta(system.file("textdata", "fasta.txt.gz", package = "snps2assays"))

The resulting data frame has columns CHROM (should correspond to the CHROM's in the VCF file), and Seq (the consensus sequence).

Putting it All Together

Doing it if you have the sequences

Now, we see how to put those together and toss everything into assayize().

# get the vcf file:
vcf <- grab_vcf(system.file("textdata", "vcf.txt.gz", package = "snps2assays"))

# get the fasta file
fasta <- grab_fasta(system.file("textdata", "fasta.txt.gz", package = "snps2assays"))

# get our data frame of target SNPs
data(example_target_snps)

# now, assayize them!
assays <- assayize(vcf, example_target_snps, fasta)
assays

Doing it with just the VCF information and lengths of contigs

If we don't have (or don't want to hassle getting) the consensus sequences, we can still run assayize with just the info you might have in the VCF file (though you can't check GC Content or output the sequences for assay design). However, you need to know the length of each contig. In our example data, that information is part of the RAD locus name (it is the 6th underscore-separated field), so we can use that.

Let's make a new V data frame that has a LENGTH column:

vcf2 <- vcf %>%
  dplyr::mutate(LENGTH = stringr::str_split(CHROM, "_") %>%
           lapply("[", 6) %>%
           unlist %>%
           as.numeric # Note that if LENGTH is not numeric, assayize will throw an error!
  )

Now that vcf2 has a LENGTH column, we can assayize it. And for fun, let's require more distance around the SNP and only return the target SNPs:

assays2 <- assayize(vcf2, example_target_snps, reqDist = 25, allVar = FALSE)
assays2


eriqande/snps2assays documentation built on Oct. 9, 2020, 5:22 p.m.