dart2nexus: Convert SNPs data from a genlight object (from Dart...
In carlopacioni/amplicR: An R package to process amplicon data

dart2nexus

R Documentation

Convert SNPs data from a genlight object (from Dart sequencing) to phased alleles

Description

This function was developed to convert Dart sequencing data into phased allele sequences. As such, the formatting of the data is expected to be what is provided by the dartR package. That is, the data used as input are generally generated using dartR::gl.read.dart() and processed (filter) in dartR. Custom data can be used with this function as long as they are formatted in a compatible way. At its mimimum, the last three characters of locNames(gl) has to have the base and alternative SNP separate by a forward slash '/', and have a loc.metrics element with at least the following headings:

CloneID
AlleleSequence
TrimmedSequence
SnpPosition

Usage

dart2nexus(
  gl,
  dir.in = NULL,
  min.nSNPs = 3,
  minAbund = NULL,
  minLen = 77,
  truncQ = 20,
  minQ = 25,
  dir.out = NULL,
  singleAllele = TRUE,
  dada = TRUE,
  nCPUs = "auto",
  nex.out = "phasedAln.nex"
)

Arguments

`gl`	The genlight object with the processed data
`dir.in`	Character vector with the path to the directory where the fastq and targets.csv files are located. If `NULL` and interactive pop up windows will be used to select the directory. If `NA` no sequences are used.
`min.nSNPs`	Integer indicating the minimum number of SNPs that a locus has to have to be retained
`minAbund`	Either NULL (default) or the minimum number of identical reads that an alleles from the raw sequences needs to have to be retained
`minLen`	Minimum length of reads to keep when applying the filter
`truncQ`	Truncate reads at the first instance of a quality score less than or equal to truncQ when conducting quality filtering. See `fastqFilter` for details
`dir.out`	Character vector with the name of the directory where to save the results
`singleAllele`	Whether only one random allele should be selected for each sample (TRUE), or both (FALSE)
`dada`	Logical. Should the dada analysis be conducted? (default `TRUE`)
`nCPUs`	Integer for the number of CPUs to use (for parallel computation) or "auto" to automatically call all available CPUs. If 1, no parallel computation.
`nex.out`	The name of the nexus file. Default: "phasedAln.nex"

Details

Here, I define 'loci' as one segment (read) from the sequencing data. These are generally identified as CloneID in Dart data. One locus can contain multiple SNPs and often, for phylogenetic analyses, only loci with multiple SNPs are of interest because they contain more information (see Trucchi et al. 2014).

The output of this function is a nexus alignment with the concatenated sequences of the alleles. Once generated, these data can be used for phylogenetic analyses in software such as BEAST (ref). At the bottom of the nexus file there are a series of charset so that it is possible to partition the alignment by locus.

dart2nexus initially uses information from the genotypes to create all possible allele sequences. If within any given read, there is no more than one SNP that is heterozygous, there are at the most two possible alleles. If instead, there are >1 heterozygous SNPs, raw sequence data need to be provided to resolve the phase of the allele. If raw sequences are not available, IUPAC ambiguities will be used.

Where raw sequence data are available, these are read and processed using a combination of R packages. Most importantly, filtering of the sequences is done using dada::fastqFilter. Even when the sequences are provided, there might be situations where multiple reads are possible candidates for the alleles. This may happen when spurious reads are retained after filtering. There is not clear way to replicate Dart sequence processing, and playing around with the settings may help in removing these spurious reads. In the limited testing I have conducted, the default settings seems to work adequately in most cases, but there is no guarantee. When there are multiple candidate sequences, an assumption is made that the two most abundant sequences are the correct ones.

After filtering the sequences with dada::fastqFilter it is possible to remove all the (unique) reads that do not have a minimum abundance threshold with minAbund.

It is also possible to include and additional step using dada::dada (see ?dada::dada for more information) after filtering. This is achieved by setting dada=TRUE. In the testing done, this doesn't seem to help much, while it may help resolving a few loci, it seems to leave many more loci with missing data and it should be consider somewhat experimental. I found that often, using minAbund, gives the same improvement without leaving other loci with missing data.

If sequences are provided, dart2nexus is also expecting to find in the same location the .csv files the map the dastq files of the sequences with the sample name that are listed in gl$other$loc.metrics. In the .csv, the fastq file names are generally identified as targetid. The sample ID are typically in a column named genotype. These files also contain teh barcode sequence. The same samples may have sequences in multiple targetid.

Value

Write a nexus file in dir.out with name |codenex.out and return a list with the following elements:

freqSNPs A table with the frequencies of the number of SNPs in each locus
SampleMultAlleles The name of the samples with multiple (i.e. >2) possible alleles and their number
Allele1 The concatenated sequence of the first allele
Allele2 The concatenated sequence of the second allele

References

Trucchi, E., P. Gratton, J. D. Whittington, R. Cristofari, Y. Le Maho, N. C. Stenseth and C. Le Bohec, 2014: King penguin demography since the last glaciation inferred from genome-wide data. Proceedings of the Royal Society B: Biological Sciences, 281, 20140528.

carlopacioni/amplicR documentation built on Aug. 19, 2023, 7:59 p.m.