There is one main function in this package assayize()
that takes data frame input
of the necessary ingredients (all variants, variants you want to target, and
consensus sequences) and then spits out information useful for ordering assays
and knowing whether they have a good chance of being designable.
There are also a few helper functions to get from common formats to the necessary data frames. For example, from VCF to a data frame of all the variants, or from FASTA to a data frame of consensus sequences. We demonstrate those in a series of workflows here.
Let's get going by loading the tidyverse and then loading the package
#library(tidyverse) library(magrittr) library(snps2assays)
Let's say that you have all your variants in a big gzipped (or not) VCF file. Here are the
top lines of the example VCF file that gets installed with the snps2assays
package. They will
run off the page, but you get the idea:
# print out the first two lines of data (after all the headers) con <- gzfile(system.file("textdata", "vcf.txt.gz", package = "snps2assays")) readLines(con, n = 20) %>% cat(sep = "\n") close(con)
To read a VCF file (it can be gzipped, but it can't be a BCF file!) you can use the grab_vcf
function
which will slurp the whole thing into a tbl_df
data frame.
grab_vcf(system.file("textdata", "vcf.txt.gz", package = "snps2assays"))
Note that this function searches for the header by trying to find a line that starts with "#CHROM", so, if your VCF files don't roll that way, you'll have problems.
To go along with that example VCF file we have 25 example target SNPs:
example_target_snps
They might be a little truncated in the view above.
Let's look at the example fasta file that comes with the package and that corresponds to the sequences in the example VCF file.
# print out the first 6 lines of the fasta file con <- gzfile(system.file("textdata", "fasta.txt.gz", package = "snps2assays")) readLines(con, n = 10) %>% cat(sep = "\n") close(con)
That is what the fasta file looks like. Here is how we can read it into a data frame
with grab_fasta
.
grab_fasta(system.file("textdata", "fasta.txt.gz", package = "snps2assays"))
The resulting data frame has columns CHROM
(should correspond to the CHROM
's in the VCF file),
and Seq
(the consensus sequence).
Now, we see how to put those together and toss everything into assayize()
.
# get the vcf file: vcf <- grab_vcf(system.file("textdata", "vcf.txt.gz", package = "snps2assays")) # get the fasta file fasta <- grab_fasta(system.file("textdata", "fasta.txt.gz", package = "snps2assays")) # get our data frame of target SNPs data(example_target_snps) # now, assayize them! assays <- assayize(vcf, example_target_snps, fasta) assays
If we don't have (or don't want to hassle getting) the consensus sequences, we can
still run assayize
with just the info you might have in the VCF file (though you can't
check GC Content or output the sequences for assay design). However, you need to
know the length of each contig. In our example data, that information is part
of the RAD locus name (it is the 6th underscore-separated field), so we can use that.
Let's make a new V data frame that has a LENGTH
column:
vcf2 <- vcf %>% dplyr::mutate(LENGTH = stringr::str_split(CHROM, "_") %>% lapply("[", 6) %>% unlist %>% as.numeric # Note that if LENGTH is not numeric, assayize will throw an error! )
Now that vcf2
has a LENGTH
column, we can assayize it.
And for fun, let's require more distance around the SNP
and only return the target SNPs:
assays2 <- assayize(vcf2, example_target_snps, reqDist = 25, allVar = FALSE) assays2
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.