read_dart: Read and tidy DArT output files.

read_dartR Documentation

Read and tidy DArT output files.

Description

Used internally in radiator and might be of interest for users. The function generate a GDS object/file and optionally, a tidy dataset using DArT files.

Usage

read_dart(
  data,
  strata,
  filename = NULL,
  tidy.dart = FALSE,
  verbose = FALSE,
  parallel.core = parallel::detectCores() - 1,
  ...
)

Arguments

data

One of the DArT output files. 6 formats used by DArT are recognized by radiator. recognised:

  1. 1row: Genotypes are in 1 row and coded (0, 1, 2, -). 0 for 2 reference alleles REF/REF, 1 for 2 alternate alleles ALT/ALT, 2 for heterozygote REF/ALT, - for missing.

  2. 2rows: No genotypes. It's absence/presence, 0/1, of the REF and ALT alleles. Sometimes called binary format.

  3. counts: No genotypes, It's counts/read depth for the REF and ALT alleles. Sometimes just called count data.

  4. silico.dart: SilicoDArT data. No genotypes, no REF or ALT alleles. It's a file coded as absence/presence, 0/1, for the presence of sequence in the clone id.

  5. silico.dart.counts: SilicoDArT data. No genotypes, no REF or ALT alleles. It's a file coded as absence/presence, with counts for the presence of sequence in the clone id.

  6. dart.vcf: For DArT VCFs, please use read_vcf.

Depending on the number of markers, these format will be recoded similarly to VCF files (dosage of alternate allele, see details).

The function can import .csv or .tsv files.

If you encounter a problem, sent me your data so that I can update the function.

strata

A tab delimited file or object with 3 columns. Columns header is: TARGET_ID, INDIVIDUALS and STRATA. Note: the column STRATA refers to any grouping of individuals. You need to make sure that the column TARGET_ID match the id used by DArT. With the counts format the TARGET_ID is a series of integer. With 1row and 2rows the TARGET_ID is actually the sample name submitted to DArT. The column INDIVIDUALS and STRATA will be kept in the tidy data. Only individuals in the strata file are kept in the tidy, i.e. that the strata is also used as a whitelist of individuals/strata. Silico DArT data is currently used to detect sex markers, so the STRATA column should be filed with sex information: M or F.

See example on how to extract the TARGET_ID of your DArT file.

example.dart.strata.tsv.

filename

(optional) The function uses write.fst, to write the tidy data frame in the working directory. The file extension appended to the filename provided is .rad. With default: filename = NULL, the tidy data frame is in the global environment only (i.e. not written in the working directory...).

tidy.dart

(logical, optional) Generate a tidy dataset. Default:tidy.dart = FALSE.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

...

(optional) To pass further argument for fine-tuning the function.

Value

A radiator GDS file and tidy dataframe with several columns depending on DArT file: silico.dart: A tibble with 5 columns: CLONE_ID, SEQUENCE, VALUE, INDIVIDUALS, STRATA. This object is also saved in the directory (file ending with .rad).

Common to 1row, 2rows and counts: A GDS file is automatically generated. To have a tidy tibble, the argument tidy.dart = TRUE must be used.

  1. VARIANT_ID: generated by radiator and correspond the markers in integer.

  2. MARKERS: generated by radiator and correspond to CHROM + LOCUS + POS separated by 2 underscores.

  3. CHROM: the chromosome info, for de novo: CHROM_1.

  4. LOCUS: the locus info.

  5. POS: the SNP id on the LOCUS.

  6. COL: the position of the SNP on the short read.

  7. REF: the reference allele.

  8. ALT: the alternate allele.

  9. INDIVIDUALS: the sample name.

  10. STRATA/POP_ID: populations id of the sample.

  11. GT_BIN: the genotype based on the number of alternate allele in the genotype (the count/dosage of the alternate allele). 0, 1, 2, NA.

  12. REP_AVG: the reproducibility average, output specific of DArT.

Other columns potentially in the tidy tibble:

  1. GT: the genotype in 6 digit format à la genepop.

  2. GT_VCF: the genotype in VCF format 0/0, 0/1, 1/1, ./..

  3. GT_VCF_NUC: the genotype in VCF format, but keeping the nucleotide information. A/A, A/T, T/T, ./.

  4. AVG_COUNT_REF: the coverage for the reference allele, output specific of DArT.

  5. AVG_COUNT_SNP: the coverage for the alternate allele, output specific of DArT.

  6. READ_DEPTH: the number of reads used for the genotype (count data).

  7. ALLELE_REF_DEPTH: the number of reads of the reference allele (count data).

  8. ALLELE_ALT_DEPTH: the number of reads of the alternate allele (count data).

Written in the working directory:

  • The radiator GDS file

  • The DArT metadata information

  • The tidy DArT data

  • The strata file associated with this tidy dataset

  • The allele dictionary is a tibble with columns: MARKERS, CHROM, LOCUS, POS, REF, ALT.

Advance mode

dots-dots-dots ... allows to pass several arguments for fine-tuning the function:

  1. whitelist.markers: detailed in filter_whitelist. Defautl: whitelist.markers = NULL.

  2. missing.memory (option, path) This argument allows to erase genotypes that have bad statistics. It's the path to a file .rad file that contains 3 columns: MARKERS, INDIVIDUALS, ERASE. The file is produced by several radiator functions. For DArT data, filter_rad generate the file. Defautl: missing.memory = NULL. Currently not used.

  3. path.folder: (optional, path) To write output in a specific folder. Default: path.folder = NULL. The working directory is used.

  4. pop.levels: detailed in tidy_genomic_data.

Author(s)

Thierry Gosselin thierrygosselin@icloud.com

See Also

extract_dart_target_id

Examples

## Not run: 
clownfish.dart.tidy <- radiator::read_dart(
    data = "clownfish.dart.csv",
    strata = "clownfish.strata.tsv"
    )

## End(Not run)

thierrygosselin/radiator documentation built on Nov. 7, 2024, 1:30 p.m.