Description Usage Arguments Value Examples
This function takes, as input, data frames of:
all the known variation in different (typically short) contigs,
the SNPs you want to turn into assays,
the sequences of all the contigs (or at least the ones that you want to turn into assays).
It returns information needed for screening variation and ordering SNP assays.
1 2 3 4 5 6 7 8 9 |
V |
a data frame of all the variants detected in sequences from individuals. Typically this will be from a VCF file (maybe just the first few columns). It can have any number of columns, but it must have the following:
Any columns other than |
targets |
A data frame of the variants that are targets for assay development. This
can also have whatever columns that are desired, but it must have CHROM and POS, concordant with |
consSeq |
A data frame of consensus sequences. Must have one column |
reqDist |
The required minimum number of bases between a target SNP and the nearest flanking SNP for a target to be designable (according to whomever's criterion). |
reqDistFromEnd |
The required number of bases between the SNP site and the start or end of the contig for the SNP to be designable as an assay. |
GCmax |
The maximum GC content of the reference contig allowable for an assay to still be considered designable. |
allVar |
Logical indicating whether all variable sites within the CHROMs containing the targets should be returned, or just the targets themselves. |
Returns a data frame with all the columns in targets
along with additional columns
appended to it. Namely, you can expect the following columns.
The contig identifier
Position of the focal SNP within the contig
Length of the contig.
The reference base at the site
The alternative base(s) at the site
Number of SNP-free bases to the "left" of the focal SNP
Number of SNP-free bases to the "right" of the focal SNP
Fraction of sites in the reference sequence that are G's or C's. Does not appear if
consSeq
is NULL
Logical indicating whether this focal SNP is designable given the
criteria specified in reqDist
, reqDistFromEnd
, and, if consSeq
is not
NULL, also given GCmax
.
All the bases (or other variation) observed at the SNP in sorted order
The IUPAC code that describes the variation observed at the focal SNP
The sequence of the contig as prepared for assay order. Variation at non-focal SNPs
is given by IUPAC codes while variation at the focal SNP is specified like so: [T/G]. This
column is not present if consSeq
is NULL.
targets
Whatever extra columns that were
in targets
that were not duplicates of the above colums get stuck on in after CHROM
and POS
If allVar
is set to FALSE then only rows corresponding to SNPs in targets
will
be returned. If allVar
is TRUE, then all the SNPs in V that fall in the contigs within
which the SNPs in targets
fall will be returned (one row for each). In that case, the
"Any other columns that were present in targets
" will be NA.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | library(dplyr)
# get the vcf file:
vcf <- grab_vcf(system.file("textdata", "vcf.txt.gz", package = "snps2assays"))
# get the fasta file
fasta <- grab_fasta(system.file("textdata", "fasta.txt.gz", package = "snps2assays"))
# get our data frame of target SNPs
data(example_target_snps)
# now, assayize them!
assays <- assayize(vcf, example_target_snps, fasta)
assays
# here we run it without the consensus sequences. The length of each contig is part
# of the RAD locus name (the 6th "_"-separated field), so we can use that:
vcf2 <- vcf %>%
mutate(LENGTH = stringr::str_split(CHROM, "_") %>%
lapply("[", 6) %>%
unlist %>%
as.numeric # Note that if LENGTH is not numeric assayize will throw an error!
)
# now that vcf2 has a LENGTH column, we can assayize it
# and for fun, let's require more distance around the SNP
# and only return the target SNPs
assays2 <- assayize(vcf2, example_target_snps, reqDist = 25, allVar = FALSE)
assays2
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.