Nothing
#' @title Convert vcfR to DNAbin
#' @name vcfR2DNAbin
#'
#' @rdname vcfR2DNAbin
#' @aliases vcfR2DNAbin
#'
#' @description
#' Convert objects of class vcfR to objects of class ape::DNAbin
#'
#' @param x an object of class chromR or vcfR
#' @param extract.indels logical indicating to remove indels (TRUE) or to include them while retaining alignment
#' @param consensus logical, indicates whether an IUPAC ambiguity code should be used for diploid heterozygotes
#' @param extract.haps logical specifying whether to separate each genotype into alleles based on a delimiting character
# @param gt.split character to delimit alleles within genotypes
#' @param unphased_as_NA logical indicating how to handle alleles in unphased genotypes
#' @param asterisk_as_del logical indicating that the asterisk allele should be converted to a deletion (TRUE) or NA (FALSE)
#' @param ref.seq reference sequence (DNAbin) for the region being converted
#' @param start.pos chromosomal position for the start of the ref.seq
#' @param verbose logical specifying whether to produce verbose output
#'
#' @details
#' Objects of class \strong{DNAbin}, from the package ape, store nucleotide sequence information.
#' Typically, nucleotide sequence information contains all the nucleotides within a region, for example, a gene.
#' Because most sites are typically invariant, this results in a large amount of redundant data.
#' This is why files in the vcf format only contain information on variant sites, it results in a smaller file.
#' Nucleotide sequences can be generated which only contain variant sites.
#' However, some applications require the invariant sites.
#' For example, inference of phylogeny based on maximum likelihood or Bayesian methods requires invariant sites.
#' The function vcfR2DNAbin therefore includes a number of options in attempt to accomodate various scenarios.
#'
#'
#' The presence of indels (insertions or deletions)in a sequence typically presents a data analysis problem.
#' Mutation models typically do not accomodate this data well.
#' For now, the only option is for indels to be omitted from the conversion of vcfR to DNAbin objects.
#' The option \strong{extract.indels} was included to remind us of this, and to provide a placeholder in case we wish to address this in the future.
#'
#'
#' The \strong{ploidy} of the samples is inferred from the first non-missing genotype.
# The option \code{gt.split} is used to split this genotype into alleles and these are counted.
# Values for \code{gt.split} are typically '|' for phased data or '/' for unphased data.
# Note that this option is an exact match and not used in a regular expression, as the 'sep' parameter in \code{\link{vcfR2genind}} is used.
#' All samples and all variants within each sample are assumed to be of the same ploid.
#'
#'
#' Conversion of \strong{haploid data} is fairly straight forward.
#' The options \code{consensus} and \code{extract.haps} are not relevant here.
#' When vcfR2DNAbin encounters missing data in the vcf data (NA) it is coded as an ambiguous nucleotide (n) in the DNAbin object.
#' When no reference sequence is provided (option \code{ref.seq}), a DNAbin object consisting only of variant sites is created.
#' When a reference sequence and a starting position are provided the entire sequence, including invariant sites, is returned.
#' The reference sequence is used as a starting point and variable sitees are added to this.
#' Because the data in the vcfR object will be using a chromosomal coordinate system, we need to tell the function where on this chromosome the reference sequence begins.
#'
#'
#' Conversion of \strong{diploid data} presents a number of scenarios.
#' When the option \code{consensus} is TRUE and \code{extract.haps} is FALSE, each genotype is split into two alleles and the two alleles are converted into their IUPAC ambiguity code.
#' This results in one sequence for each diploid sample.
#' This may be an appropriate path when you have unphased data.
#' Note that functions called downstream of this choice may handle IUPAC ambiguity codes in unexpected manners.
#' When extract.haps is set to TRUE, each genotype is split into two alleles.
#' These alleles are inserted into two sequences.
#' This results in two sequences per diploid sample.
#' Note that this really only makes sense if you have phased data.
#' The options ref.seq and start.pos are used as in halpoid data.
#'
#'
#' When a variant overlaps a deletion it may be encoded by an \strong{asterisk allele (*)}.
#' The GATK site covers this in a post on \href{https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele-}{Spanning or overlapping deletions} ].
#' This is handled in vcfR by allowing the user to decide how it is handled with the paramenter \code{asterisk_as_del}.
#' When \code{asterisk_as_del} is TRUE this allele is converted into a deletion ('-').
#' When \code{asterisk_as_del} is FALSE the asterisk allele is converted to NA.
#' If \code{extract.indels} is set to FALSE it should override this decision.
#'
#'
#' Conversion of \strong{polyploid data} is currently not supported.
#' However, I have made some attempts at accomodating polyploid data.
#' If you have polyploid data and are interested in giving this a try, feel free.
#' But be prepared to scrutinize the output to make sure it appears reasonable.
#'
#'
#' Creation of DNAbin objects from large chromosomal regions may result in objects which occupy large amounts of memory.
#' If in doubt, begin by subsetting your data and the scale up to ensure you do not run out of memory.
#'
#'
#'
#'
#' @seealso
#' \href{https://cran.r-project.org/package=ape}{ape}
#'
#'
#' @examples
#' library(ape)
#' data(vcfR_test)
#'
#' # Create an example reference sequence.
#' nucs <- c('a','c','g','t')
#' set.seed(9)
#' myRef <- as.DNAbin(matrix(nucs[round(runif(n=20, min=0.5, max=4.5))], nrow=1))
#'
#' # Recode the POS data for a smaller example.
#' set.seed(99)
#' vcfR_test@fix[,'POS'] <- sort(sample(10:20, size=length(getPOS(vcfR_test))))
#'
#' # Just vcfR
#' myDNA <- vcfR2DNAbin(vcfR_test)
#' seg.sites(myDNA)
#' image(myDNA)
#'
#' # ref.seq, no start.pos
#' myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef)
#' seg.sites(myDNA)
#' image(myDNA)
#'
#' # ref.seq, start.pos = 4.
#' # Note that this is the same as the previous example but the variants are shifted.
#' myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef, start.pos = 4)
#' seg.sites(myDNA)
#' image(myDNA)
#'
#' # ref.seq, no start.pos, unphased_as_NA = FALSE
#' myDNA <- vcfR2DNAbin(vcfR_test, unphased_as_NA = FALSE, ref.seq = myRef)
#' seg.sites(myDNA)
#' image(myDNA)
#'
#'
#'
#' @export
vcfR2DNAbin <- function( x,
extract.indels = TRUE,
consensus = FALSE,
extract.haps = TRUE,
unphased_as_NA = TRUE,
asterisk_as_del = FALSE,
ref.seq = NULL,
start.pos = NULL,
verbose = TRUE )
{
# Sanitize input.
#if( class(x) == 'chromR' ){ x <- x@vcf }
if( inherits(x, 'chromR') ){
x <- x@vcf
}
#if( class(x) != 'vcfR' ){
if( !inherits(x, 'vcfR') ){
stop( "Expecting an object of class chromR or vcfR" )
}
if( consensus == TRUE & extract.haps == TRUE){
stop("consensus and extract_haps both set to TRUE. These options are incompatible. A haplotype should not be ambiguous.")
}
#if( !is.null(start.pos) & class(start.pos) == "character" ){
if( !is.null(start.pos) & inherits(start.pos, "character") ){
start.pos <- as.integer(start.pos)
}
if( extract.indels == FALSE & consensus == TRUE ){
msg <- "invalid selection: extract.indels set to FALSE and consensus set to TRUE."
msg <- c(msg, "There is no IUPAC ambiguity code for indels")
stop(msg)
}
# Check and sanitize ref.seq.
#if( class(ref.seq) != 'DNAbin' & !is.null(ref.seq) ){
if( !inherits(ref.seq, 'DNAbin') & !is.null(ref.seq) ){
stop( paste("expecting ref.seq to be of class DNAbin but it is of class", class(ref.seq)) )
}
if( is.list(ref.seq) ){
#ref.seq <- as.matrix(ref.seq)
ref.seq <- ref.seq[[1]]
}
if( is.matrix(ref.seq) ){
ref.seq <- ref.seq[1,]
ref.seq <- ref.seq[1:ncol(ref.seq)]
}
# If vector
# dna <- as.matrix(t(dna))
# Check start.pos
if( is.null(start.pos) & !is.null(ref.seq) ){
if( verbose == TRUE ){
warning("start.pos == NULL, this means that I do not know where the variants are located in the ref.seq. I'll try start.pos == 1, but results may be unexpected")
}
start.pos <- 1
}
# Extract indels.
# Currently the only option is TRUE.
if( extract.indels == TRUE ){
x <- extract.indels(x)
if( verbose == TRUE ){
message(paste("After extracting indels,", nrow(x), "variants remain."))
}
} else {
# stop("extract.indels == FALSE is not currently implemented.")
# Make alleles at each locus the same length
# https://stackoverflow.com/a/36136878
equal_allele_len <- function(x){
alleles <- c(myRef[x], unlist(strsplit(myAlt[x], split = ",")))
alleles[is.na(alleles)] <- 'n'
alleles <- format(alleles, width=max(nchar(alleles)))
alleles <- gsub("\\s", "-", alleles)
myRef[x] <<- alleles[1]
myAlt[x] <<- paste(alleles[2:length(alleles)], collapse=",")
invisible()
}
myRef <- getREF(x)
myAlt <- getALT(x)
invisible(lapply(1:length(myRef), equal_allele_len))
x@fix[,'REF'] <- myRef
x@fix[,'ALT'] <- myAlt
}
# Save POS in case we need it.
# i.e., for inserting variants into a matrix.
# pos <- as.numeric(x@fix[,'POS'])
pos <- getPOS(x)
# If we think we have variants we should extract them.
# Our GT matrix may contain zero rows, all NA or data.
# Check for zero rows.
if( nrow(x@fix) == 0 ){
# Create an empty matrix.
x <- x@gt[ 0, -1 ]
} else if( sum(!is.na(x@gt[,-1])) == 0 ){
# Check for all NA.
# Case of zero rows will sum to zero here.
# Create an empty matrix.
x <- x@gt[ 0, -1 ]
} else {
# If x is still of class vcfR, we should process it.
# if( class(x) == "vcfR" ){
# first.gt <- x@gt[ ,-1 ][ !is.na(x@gt[,-1]) ][1]
if( consensus == TRUE & extract.haps == FALSE ){
# x <- extract.gt( x, return.alleles = TRUE, allele.sep = gt.split )
# x <- alleles2consensus( x, sep = gt.split )
x <- extract.gt( x, return.alleles = TRUE )
x <- alleles2consensus( x )
} else {
# x <- extract.haps( x, gt.split = gt.split, verbose = verbose )
x <- extract.haps( x, unphased_as_NA = unphased_as_NA, verbose = verbose )
}
}
if( asterisk_as_del == TRUE){
x[ x == "*" & !is.na(x) ] <- '-'
} else {
x[ x == "*" & !is.na(x) ] <- 'n'
}
# Data could be haploid, diploid or higher ploid.
# x should be a matrix of variants by here.
# Return full sequence when ref.seq is not NULL
if( is.null(ref.seq) == FALSE ){
# Create a matrix of nucleotides.
# The number of columns should match the number
# of columns in x (i.e., number of haplotypes).
# The number of rows should match the reference
# sequence length.
# The matrix will be initialized with the
# reference and will have no variants.
variants <- x
x <- matrix( as.character(ref.seq),
nrow = length(ref.seq),
ncol = length(colnames(x)),
byrow = FALSE
)
colnames(x) <- colnames(variants)
# Populate matrix of reference sequences with variants.
# We need to subset the variant data to the region of interest.
# First we remove variants above the region.
# Then we remove variants below this region.
# Then we rescale the region to be one-based.
# variants <- variants[ pos < start.pos + dim(ref.seq)[2], , drop = FALSE]
# pos <- pos[ pos < start.pos + dim(ref.seq)[2] ]
variants <- variants[ pos < start.pos + length(ref.seq), , drop = FALSE]
pos <- pos[ pos < start.pos + length(ref.seq) ]
variants <- variants[ pos >= start.pos, , drop = FALSE]
pos <- pos[ pos >= start.pos ]
pos <- pos - start.pos + 1
x[pos,] <- variants
}
# Convert NA to n
x[ is.na(x) ] <- 'n'
# DNAbin characters must be lower case.
# tolower requires dim(X) to be positive.
if( nrow(x) > 0 ){
# x <- apply( x, MARGIN=2, tolower )
x <- tolower( x )
}
# Convert matrix to DNAbin
if( extract.indels == FALSE ){
# Indel strings need to be split into characters
x <- apply(x, MARGIN=2, function(x){ unlist(strsplit(x,"")) })
}
x <- ape::as.DNAbin(t(x))
return(x)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.