read_dart | R Documentation |
Used internally in radiator and might be of interest for users. The function generate a GDS object/file and optionally, a tidy dataset using DArT files. Read the details section to understand why it's better than dartR.
read_dart(
data,
strata,
filename = NULL,
tidy.dart = FALSE,
calibrate.alleles = TRUE,
verbose = TRUE,
parallel.core = parallel::detectCores() - 1,
...
)
data |
(file) 6 files formats used by DArT are recognized by radiator.
Don't modify the DArT file, to do this, use the
If you encounter a problem, sent me your data so that I can update the function. |
strata |
A tab delimited file or object with a minimum of 3 columns headers:
|
filename |
(optional) The function uses |
tidy.dart |
(logical, optional) Generate a tidy dataset.
Default: |
calibrate.alleles |
(optional, logical)
Default: |
verbose |
(optional, logical) When |
parallel.core |
(optional) The number of core used for parallel
execution during import.
Default: |
... |
(optional) To pass further argument for fine-tuning the function. |
More details on what is happening under the hood when you import the DArT file in R:
The DArT file is imported
DArT files should not be modify.
A lot of imports problems originates from files modifications, a couple of common checks are done.
The format (1row, 2rows, counts, silico) is interpreted.
The number of target ids is checked.
The strata file file is imported:
This is the file that needs modifications.
This is here that you change the bad samples names.
Remove target ids (blacklist samples that you no longer want).
Change the order of the populations/sampling sites.
or alternatively, you can use the pop.levels
argument
(see dots-dots-dots ... in the advance mode section below)
The target ids between the DArT and strata files are verified and the files are merged.
The data is inspected for duplicated names
DArT changed colnames in their files along the years, we tidy things:
colnames in camelcase are changed to snakecase
ALLELE_SEQUENCE is changed to SEQUENCE
TRIMMED_SEQUENCE is changed to SEQUENCE
CLUSTER_CONSENSUS_SEQUENCE is changed to SEQUENCE
Genomic metadata are named and or re-named based on the Variant Call Format Specification:CHROM, LOCUS, POS, COL, REF, ALT
With this function you have the option to tidy the DArT file:
What's that ? R for Data Science: explanation
It takes longer and you need more memory, but if you can allow it, it's better for inspection and visualisation.
or you wait to filter the data and generate a tidy dataset with
tidy_genomic_data
REF and ALT alleles are re-calibrated with calibrate_alleles
:
This is not optional
It takes longer than just reading the file like other software and packages, but it's better.
A radiator GDS file and tidy dataframe with several columns depending on DArT file:
silico.dart:
A tibble with 5 columns: CLONE_ID, SEQUENCE, VALUE, INDIVIDUALS, STRATA
.
This object is also saved in the directory (file ending with .rad).
Common to 1row, 2rows and counts
: A GDS file is automatically generated.
To have a tidy tibble, the argument tidy.dart = TRUE
must be used.
VARIANT_ID: generated by radiator and correspond the markers in integer.
MARKERS: generated by radiator and correspond to CHROM + LOCUS + POS separated by 2 underscores.
CHROM: the chromosome info, for de novo: CHROM_1.
LOCUS: the locus info.
POS: the SNP id on the LOCUS.
COL: the position of the SNP on the short read.
REF: the reference allele.
ALT: the alternate allele.
INDIVIDUALS: the sample name.
STRATA/POP_ID: populations id of the sample.
GT_BIN: the genotype based on the number of alternate allele in the genotype
(the count/dosage of the alternate allele). 0, 1, 2, NA
.
REP_AVG: the reproducibility average, output specific of DArT.
Other columns potentially in the tidy tibble:
GT: the genotype in 6 digit format à la genepop.
GT_VCF: the genotype in VCF format 0/0, 0/1, 1/1, ./.
.
GT_VCF_NUC: the genotype in VCF format, but keeping the nucleotide information.
A/A, A/T, T/T, ./.
AVG_COUNT_REF: the coverage for the reference allele, output specific of DArT.
AVG_COUNT_SNP: the coverage for the alternate allele, output specific of DArT.
READ_DEPTH: the number of reads used for the genotype (count data).
ALLELE_REF_DEPTH: the number of reads of the reference allele (count data).
ALLELE_ALT_DEPTH: the number of reads of the alternate allele (count data).
Written in the working directory:
The radiator GDS file
The DArT metadata information
The tidy DArT data
The strata file associated with this tidy dataset
The allele dictionary is a tibble with columns:
MARKERS, CHROM, LOCUS, POS, REF, ALT
.
dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
whitelist.markers
: detailed in filter_whitelist
.
Defautl: whitelist.markers = NULL
.
missing.memory
(option, path)
This argument allows to erase genotypes that have bad statistics.
It's the path to a file .rad
file that contains 3 columns:
MARKERS, INDIVIDUALS, ERASE
. The file is produced by several radiator
functions. For DArT data, filter_rad
generate the file.
Defautl: missing.memory = NULL
. Currently not used.
path.folder
: (optional, path) To write output in a specific folder.
Default: path.folder = NULL
. The working directory is used.
pop.levels
: detailed in tidy_genomic_data
.
Thierry Gosselin thierrygosselin@icloud.com
extract_dart_target_id
## Not run:
clownfish.dart.tidy <- radiator::read_dart(
data = "clownfish.dart.csv",
strata = "clownfish.strata.tsv"
)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.