tidy_genomic_data: Transform common genomic dataset format in a tidy data frame
In thierrygosselin/radiator: RADseq Data Exploration, Manipulation and Visualization using R

tidy_genomic_data

R Documentation

Transform common genomic dataset format in a tidy data frame

Description

Transform genomic data set produced by massive parallel sequencing pipeline (e.g.GBS/RADseq, SNP chip, DArT, etc) into a tidy format. The use of blacklist and whitelist along several filtering options are available to prune the dataset. Several arguments are available to make your data population-wise and easily rename the pop id. Used internally in radiator and assigner and might be of interest for users.

Usage

tidy_genomic_data(
  data,
  strata = NULL,
  filename = NULL,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE,
  ...
)

Arguments

`data`	14 options for input (diploid data only): VCFs (SNPs or Haplotypes, to make the vcf population ready), plink (tped, bed), stacks haplotype file, genind (library(adegenet)), genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT, and a data frame in long/tidy or wide format. To verify that radiator detect your file format use `detect_genomic_format` (see example below). Documented in Input genomic datasets of `tidy_genomic_data`. DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.
`strata`	(optional) The strata file is a tab delimited file with a minimum of 2 columns headers: `INDIVIDUALS` and `STRATA`. Documented in `read_strata`. DArT data: a third column `TARGET_ID` is required. Documented on `read_dart`. Also use the strata read function to blacklist individuals. Default: `strata = NULL`.
`filename`	(optional) The function uses `write.fst`, to write the tidy data frame in the working directory. The file extension appended to the `filename` provided is `.rad`. With default: `filename = NULL`, the tidy data frame is in the global environment only (i.e. not written in the working directory...).
`parallel.core`	(optional) The number of core used for parallel execution during import. Default: `parallel.core = parallel::detectCores() - 1`.
`verbose`	(optional, logical) When `verbose = TRUE` the function is a little more chatty during execution. Default: `verbose = TRUE`.
`...`	(optional) To pass further arguments for fine-tuning the function.

Value

The output in your global environment is a tidy data frame. If filename is provided, the tidy data frame is also written in the working directory with file extension .rad. The file is written with the Lightning Fast Serialization of Data Frames for R package. To read the file back in R use read.fst.

Input genomic datasets

VCF files must end with .vcf: documented in tidy_vcf
PLINK files must end with .tped or .bed: documented in tidy_plink
genind object from adegenet: documented in tidy_genind.
genlight object from adegenet: documented in tidy_genlight.
gtypes object from strataG: documented in tidy_gtypes.
dart data from DArT: documented in read_dart.
genepop file must end with .gen, documented in tidy_genepop.
fstat file must end with .dat, documented in tidy_fstat.
haplotype file created in STACKS (e.g. data = "batch_1.haplotypes.tsv"). To make the haplotype file population ready, you need the strata argument.
Data frames: documented in tidy_wide

Advance mode

dots-dots-dots ... allows to pass several arguments for fine-tuning the function:

vcf.metadata (optional, logical or string). Default: vcf.metadata = TRUE. Documented in tidy_vcf.
vcf.stats (optional, logical). Default: vcf.stats = TRUE. Documented in tidy_vcf.
whitelist.markers (optional, path or object) To keep only markers in a whitelist. Default whitelist.markers = NULL. Documented in read_whitelist.
blacklist.id (optional) Default: blacklist.id = NULL. Ideally, managed in the strata file. Documented in read_strata and read_blacklist_id.
filter.common.markers (optional, logical). Default: filter.common.markers = TRUE, Documented in filter_common_markers.
filter.monomorphic (logical, optional) Should the monomorphic markers present in the dataset be filtered out ? Default: filter.monomorphic = TRUE. Documented in filter_monomorphic.

Author(s)

Thierry Gosselin thierrygosselin@icloud.com

Examples

## Not run: 
#To verify your file is detected by radiator as the correct format:
radiator::detect_genomic_format(data = "populations.snps.vcf")


# using VCF file as input
require(SeqArray)
tidy.vcf <- tidy_genomic_data(
   data = "populations.snps.vcf", strata = "strata.treefrog.tsv",
   whitelist.markers = "whitelist.vcf.txt")

## End(Not run)

thierrygosselin/radiator documentation built on July 4, 2025, 7:52 a.m.