tidy_genomic_data | R Documentation |
Transform genomic data set produced by massive parallel sequencing pipeline (e.g.GBS/RADseq, SNP chip, DArT, etc) into a tidy format. The use of blacklist and whitelist along several filtering options are available to prune the dataset. Several arguments are available to make your data population-wise and easily rename the pop id. Used internally in radiator and assigner and might be of interest for users.
tidy_genomic_data(
data,
strata = NULL,
filename = NULL,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
)
data |
14 options for input (diploid data only): VCFs (SNPs or Haplotypes,
to make the vcf population ready),
plink (tped, bed), stacks haplotype file, genind (library(adegenet)),
genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT,
and a data frame in long/tidy or wide format. To verify that radiator detect
your file format use DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset. |
strata |
(optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
|
filename |
(optional) The function uses |
parallel.core |
(optional) The number of core used for parallel
execution during import.
Default: |
verbose |
(optional, logical) When |
... |
(optional) To pass further arguments for fine-tuning the function. |
The output in your global environment is a tidy data frame.
If filename
is provided, the tidy data frame is also
written in the working directory with file extension .rad
.
The file is written with the
Lightning Fast Serialization of Data Frames for R package.
To read the file back in R use read.fst
.
VCF files must end with .vcf
: documented in tidy_vcf
PLINK files must end with .tped
or .bed
: documented in tidy_plink
genind object from
adegenet:
documented in tidy_genind
.
genlight object from
adegenet:
documented in tidy_genlight
.
gtypes object from
strataG:
documented in tidy_gtypes
.
dart data from DArT:
documented in read_dart
.
genepop file must end with .gen
, documented in tidy_genepop
.
fstat file must end with .dat
, documented in tidy_fstat
.
haplotype file created in STACKS (e.g. data = "batch_1.haplotypes.tsv"
).
To make the haplotype file population ready, you need the strata
argument.
Data frames: documented in tidy_wide
dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
vcf.metadata
(optional, logical or string).
Default: vcf.metadata = TRUE
. Documented in tidy_vcf
.
vcf.stats
(optional, logical).
Default: vcf.stats = TRUE
.
Documented in tidy_vcf
.
whitelist.markers
(optional, path or object) To keep only markers in a whitelist.
Default whitelist.markers = NULL
.
Documented in read_whitelist
.
blacklist.id
(optional) Default: blacklist.id = NULL
.
Ideally, managed in the strata file.
Documented in read_strata
and read_blacklist_id
.
filter.common.markers
(optional, logical).
Default: filter.common.markers = TRUE
,
Documented in filter_common_markers
.
filter.monomorphic
(logical, optional) Should the monomorphic
markers present in the dataset be filtered out ?
Default: filter.monomorphic = TRUE
.
Documented in filter_monomorphic
.
Thierry Gosselin thierrygosselin@icloud.com
detect_genomic_format
and genomic_converter
## Not run:
#To verify your file is detected by radiator as the correct format:
radiator::detect_genomic_format(data = "populations.snps.vcf")
# using VCF file as input
require(SeqArray)
tidy.vcf <- tidy_genomic_data(
data = "populations.snps.vcf", strata = "strata.treefrog.tsv",
whitelist.markers = "whitelist.vcf.txt")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.