dart2nexus | R Documentation |
This function was developed to convert Dart sequencing data into phased allele
sequences. As such, the formatting of the data is expected to be what is
provided by the dartR package. That is, the data used as input are generally
generated using dartR::gl.read.dart()
and processed (filter) in dartR.
Custom data can be used with this function as long as they are formatted in a
compatible way. At its mimimum, the last three characters of
locNames(gl)
has to have the base and alternative SNP separate by a
forward slash '/', and have a loc.metrics element with at least the following
headings:
CloneID
AlleleSequence
TrimmedSequence
SnpPosition
dart2nexus(
gl,
dir.in = NULL,
min.nSNPs = 3,
minAbund = NULL,
minLen = 77,
truncQ = 20,
minQ = 25,
dir.out = NULL,
singleAllele = TRUE,
dada = TRUE,
nCPUs = "auto",
nex.out = "phasedAln.nex"
)
gl |
The genlight object with the processed data |
dir.in |
Character vector with the path to the directory where the fastq
and targets.csv files are located. If |
min.nSNPs |
Integer indicating the minimum number of SNPs that a locus has to have to be retained |
minAbund |
Either NULL (default) or the minimum number of identical reads that an alleles from the raw sequences needs to have to be retained |
minLen |
Minimum length of reads to keep when applying the filter |
truncQ |
Truncate reads at the first instance of a quality score less
than or equal to truncQ when conducting quality filtering. See
|
dir.out |
Character vector with the name of the directory where to save the results |
singleAllele |
Whether only one random allele should be selected for each sample (TRUE), or both (FALSE) |
dada |
Logical. Should the dada analysis be conducted? (default
|
nCPUs |
Integer for the number of CPUs to use (for parallel computation) or "auto" to automatically call all available CPUs. If 1, no parallel computation. |
nex.out |
The name of the nexus file. Default: "phasedAln.nex" |
Here, I define 'loci' as one segment (read) from the sequencing data. These are generally identified as CloneID in Dart data. One locus can contain multiple SNPs and often, for phylogenetic analyses, only loci with multiple SNPs are of interest because they contain more information (see Trucchi et al. 2014).
The output of this function is a nexus alignment with the concatenated sequences of the alleles. Once generated, these data can be used for phylogenetic analyses in software such as BEAST (ref). At the bottom of the nexus file there are a series of charset so that it is possible to partition the alignment by locus.
dart2nexus
initially uses information from the genotypes to create all
possible allele sequences. If within any given read, there is no more than one
SNP that is heterozygous, there are at the most two possible alleles. If
instead, there are >1 heterozygous SNPs, raw sequence data need to be provided
to resolve the phase of the allele. If raw sequences are not available, IUPAC
ambiguities will be used.
Where raw sequence data are available, these are read and processed using a
combination of R packages. Most importantly, filtering of the sequences is
done using dada::fastqFilter
. Even when the sequences are provided,
there might be situations where multiple reads are possible candidates for the
alleles. This may happen when spurious reads are retained after filtering.
There is not clear way to replicate Dart sequence processing, and playing
around with the settings may help in removing these spurious reads. In the
limited testing I have conducted, the default settings seems to work
adequately in most cases, but there is no guarantee. When there are multiple
candidate sequences, an assumption is made that the two most abundant
sequences are the correct ones.
After filtering the sequences with dada::fastqFilter
it is possible to
remove all the (unique) reads that do not have a minimum abundance threshold
with minAbund
.
It is also possible to include and additional step using dada::dada (see
?dada::dada
for more information) after filtering. This is achieved by
setting dada=TRUE
. In the testing done, this doesn't seem to help much,
while it may help resolving a few loci, it seems to leave many more loci with
missing data and it should be consider somewhat experimental. I found that
often, using minAbund
, gives the same improvement without leaving other
loci with missing data.
If sequences are provided, dart2nexus
is also expecting to find in the
same location the .csv files the map the dastq files of the sequences with the
sample name that are listed in gl$other$loc.metrics
. In the .csv, the
fastq file names are generally identified as targetid. The sample ID are
typically in a column named genotype. These files also contain teh barcode
sequence. The same samples may have sequences in multiple targetid.
Write a nexus file in dir.out
with name |codenex.out and return a list with the following elements:
freqSNPs A table with the frequencies of the number of SNPs in each locus
SampleMultAlleles The name of the samples with multiple (i.e. >2) possible alleles and their number
Allele1 The concatenated sequence of the first allele
Allele2 The concatenated sequence of the second allele
Trucchi, E., P. Gratton, J. D. Whittington, R. Cristofari, Y. Le Maho, N. C. Stenseth and C. Le Bohec, 2014: King penguin demography since the last glaciation inferred from genome-wide data. Proceedings of the Royal Society B: Biological Sciences, 281, 20140528.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.