# Read VCF with SeqArray--------------------------------------------------------
# write a SeqArray object from a tidy data frame
#' @name read_vcf
#' @title Read VCF files and write a GDS file
#' @description The function reads VCF files for radiator and
#' generate a connection \href{https://github.com/zhengxwen/SeqArray}{SeqArray}
#' GDS object/file of class \code{SeqVarGDSClass} (Zheng et al. 2017).
#' The Genomic Data Structure (GDS) file format is detailed in
#' \href{https://github.com/zhengxwen/gdsfmt}{gdsfmt}.
#'
#' The function as an advance mode that allows various filtering arguments to
#' be used. Handy to prune the dataset based on various QCs but also to
#' remove artifacts, duplicated information and thus lower the overall noise in
#' the VCF.
#'
#' Used internally in \href{https://github.com/thierrygosselin/radiator}{radiator}
#' and might be of interest for users.
#' @param data (path, character) The VCF markers are biallelic SNPs or haplotypes.
#' To make the VCF population-ready, you have to use \code{strata} argument.
#'
#' \itemize{
#' \item \strong{GATK, platypus, samtools and ipyrad VCFs:}
#' Some VCFs have an \code{ID} column filled with \code{.},
#' the LOCUS information is all contained along the linkage group in the
#' \code{CHROM} column. Consequently, the short read locus information is unknown.
#' To make it work with
#' \href{https://github.com/thierrygosselin/radiator}{radiator},
#' the \code{ID} column is filled with the \code{POS} column info.
#' \item \strong{stacks VCFs:} with \emph{de novo} approaches, the CHROM column is
#' filled with "1", the LOCUS column correspond to the CHROM section in stacks VCF and
#' the COL column is POS -1. With a reference genome, the ID column in stacks VCF is
#' separated into "LOCUS", "COL", "STRANDS".
#' \item \strong{DArT VCFs:} CHROM with \code{.} are replaced by \code{denovo}.
#' \code{POS} with \code{NA} are replaced by {50}.
#' \code{COL}: the position of the SNP on the read is extracted from the
#' \code{LOCUS} as any other DArT data.
#' \code{LOCUS}: are the first group of digits the rest of the info is discarded
#' and \code{LOCUS} is then joined with \code{POS} by a \code{_}.
#' \code{POS} is replaced by \code{COL}, (col info is duplicated).
#' This is necessary to make the lines in the VCF unique.
#' }
#' @param filename (optional) The file name of the Genomic Data Structure (GDS) file.
#' radiator will append \code{.gds.rad} to the filename.
#' If the filename chosen exists in the working directory,
#' the default \code{radiator_datetime.gds} is chosen.
#' Default: \code{filename = NULL}.
#' @param vcf.stats (optional, logical) Generates individuals and markers
#' statistics helpful for filtering.
#' These are very fast to generate and because computational
#' cost is minimal, even for huge VCFs, the default is \code{vcf.stats = TRUE}.
#' Individual's missing genotype proportion, averaged heterozygosity,
#' total coverage, mean genotype coverage and marker's metadata
#' along count for ref and alt alleles and mean
#' coverage is generated and written in the working directory.
#' Default: \code{vcf.stats = TRUE}.
#' @inheritParams tidy_genomic_data
#' @inheritParams radiator_common_arguments
#' @details
#' A vcf file of 35 GB with ~4 millions SNPs take about ~7 min with 8 CPU.
#' A vcf file of 21 GB with ~2 millions SNPs take about ~5 min with 7 CPU.
#'
#' After the file is generated, you can close your computer and
#' come back to it a months later and it's now a matter of sec to open a connection.
#' See example below.
#'
#'
#' \strong{markers heterozygosity and inbreeding coefficient}:
#' The heterozygosity statistics (het obs and Fis) presented in this function
#' are calculated globally across strata, without the possibility to filter out
#' markers. We think this filtering for these statistics are best
#' addressed in \code{\link{filter_het}}, \code{\link{filter_fis}} and
#' \code{\link{filter_hwe}}, after outlier individuals and
#' markers (for other stats) are blacklisted. Why ? The reason is that an excess of
#' heterozygotes and negative Fis values are not a bad thing \emph{per se}.
#' Genomic regions under balancing selection may contain such markers and
#' statistics.
#' @section VCF file format:
#'
#' \strong{PLINK:} radiator fills the \code{LOCUS} column of PLINK VCFs with
#' a unique integer based on the \code{CHROM} column
#' (\code{as.integer(factor(x = CHROM))}).
#' The \code{COL} column is filled with 1L for lack of bettern info on this.
#' Not what you need ? Open an issue on GitHub for a request.
#'
#' \strong{ipyrad:} the pattern \code{locus_} in the \code{CHROM} column
#' is removed and used. The \code{COL} column is filled with the same value as
#' \code{POS}.
#'
#' \strong{GATK:} Some VCF have an \code{ID} column filled with \code{.},
#' the LOCUS information is all contained along the linkage group in the
#' \code{CHROM} column. To make it work with
#' \href{https://github.com/thierrygosselin/radiator}{radiator},
#' the \code{ID} column is filled with the \code{POS} column info.
#'
#' \strong{platypus:} Some VCF files don't have an ID filed with values,
#' here the same thing is done as GATK VCF files above.
#'
#' \strong{freebayes:} Some VCF files don't have an ID filed with values,
#' here the same thing is done as GATK VCF files above.
#' \strong{samtools:} Some VCF files don't have an ID filed with values,
#' here the same thing is done as GATK VCF files above.
#' \strong{stacks:} with \emph{de novo} approaches, the CHROM column is
#' filled with "1", the LOCUS column correspond to the CHROM section in stacks VCF and
#' the COL column is POS -1. With a reference genome, the ID column in stacks VCF is
#' separated into "LOCUS", "COL", "STRANDS".
#'
#' \emph{stacks problem: } current version as some intrinsic problem with
#' missing allele depth info, during the tidying process a message will
#' highlight the number of genotypes impacted by the problem. When possible, the
#' problem is corrected by adding the read depth info into the allele depth field.
#' @section Advance mode:
#'
#' \emph{dots-dots-dots ...} allows to pass several arguments for fine-tuning the function:
#' \enumerate{
#' \item \code{whitelist.markers}: detailed in \code{\link[radiator]{filter_whitelist}}.
#' \item \code{blacklist.id}: detailed in \code{\link[radiator]{tidy_genomic_data}}.
#' \item \code{pop.select}: detailed in \code{\link[radiator]{tidy_genomic_data}}.
#' \item \code{pop.levels}: detailed in \code{\link[radiator]{tidy_genomic_data}}.
#' \item \code{pop.labels}: detailed in \code{\link[radiator]{tidy_genomic_data}}.
#' \item \code{filter.strands}: (optional, character) Filter duplicate SNPs
#' found on different strands (+/-), options are:
#' \code{"keep.both", "best.stats", "blacklist"}. \code{"keep.both"}: does nothing
#' and duplicated markers are kept (not recommended, but here for testing purposes),
#' \code{"best.stats"}: will keep only one, based on the best statistics
#' (MAC and missingness). \code{"blacklist"}: discard all duplicated markers.
#' Default (\code{filter.strands = "blacklist"}).
#' \item \code{filter.common.markers}: (logical) Default: \code{filter.common.markers = TRUE}.
#' Documented in \code{\link{filter_common_markers}}.
#' \item \code{filter.ma}: (integer) Default: \code{filter.ma = NULL}.
#' To blacklist markers below a specific Minor Allele Count, Frequency or Depth (calculated overall/global).
#' Documented in \code{\link{filter_ma}}.
#'
#' \item \code{filter.coverage}: (optional, string)
#' Default: \code{filter.coverage = NULL}. To blacklist markers based on mean coverage.
#' Documented in \code{\link{filter_coverage}}.
#'
#' \item \code{filter.genotyping}: (integer) To blacklist markers with too
#' many missing data. e.g. \code{filter.genotyping = 0.1}, will only keep
#' markers with missing rate <= to 10 percent.
#' Documented in \code{\link{filter_genotyping}}.
#' \item \code{filter.snp.position.read}: 3 options available, \code{"outliers", "q75", "iqr"}.
#' This argument will blacklist markers based on it's position on the read.
#' \code{filter.snp.position.read = "outliers"}, will remove markers based
#' on outlier statistics of the position on the reads. e.g. if majority of SNPs
#' are found between 10 and 90 pb, and very few above and below, those markers are
#' discarded. Use this function argument with dataset with problematic assembly,
#' or \emph{de novo} assembly with undocumented or poorly selected
#' mismatch threshold.
#' \item \code{filter.short.ld}: Described in \code{\link[radiator]{filter_ld}}
#' \item \code{filter.long.ld}: Described in \code{\link[radiator]{filter_ld}}
#' \item \code{long.ld.missing}: Described in \code{\link[radiator]{filter_ld}}
#' \item \code{ld.method}: (optional, character) The values available are
#' \code{"composite"}, for LD composite measure, \code{"r"} for R coefficient
#' (by EM algorithm assuming HWE, it could be negative), \code{"r2"} for r^2,
#' \code{"dprime"} for D',
#' \code{"corr"} for correlation coefficient. The method corr and composite are
#' equivalent when SNPs are coded based on the presence of the alternate allele
#' (\code{0, 1, 2}).
#' Default: \code{ld.method = "r2"}.
#' \item \code{filter.individuals.missing}: (double) Use this argument to
#' blacklist individuals with too many missing data.
#' e.g. \code{filter.individuals.missing = 0.7}, will remove individuals with >
#' 0.7 or 70% missing genotypes. This can help discover more polymorphic markers
#' with some dataset.
#' \item \code{markers.info}: (character) With default: \code{markers.info = NULL},
#' all the variable in the vcf INFO field are imported.
#' To import only DP (the SNP total read depth) and AF (the SNP allele frequency),
#' use \code{markers.info = c("DP", "AF")}.
#' Using, \code{markers.info = character(0)} will not import INFO variables.
#' \item \code{path.folder}: to write ouput in a specific path
#' (used internally in radiator). Default: \code{path.folder = getwd()}.
#' If the supplied directory doesn't exist, it's created.
#' \item \code{random.seed}: (integer, optional) For reproducibility, set an integer
#' that will be used inside codes that uses randomness. With default,
#' a random number is generated, printed and written in the directory.
#' Default: \code{random.seed = NULL}.
#' \item \code{subsample.markers.stats}: By default, when no filters are
#' requested and that the number of markers is > 200K,
#' 0.2 of markers are randomly selected to generate the
#' statistics (individuals and markers). This is an all-around and
#' reliable number.
#' In doubt, overwrite this value by using 1 (all markers selected) and
#' expect a small computational cost.
#' }
#' @return
#' The function returns a GDS object.
#' @export
#' @rdname read_vcf
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
#' @references Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS,
#' Laurie C, Levine D (2017). SeqArray -- A storage-efficient high-performance
#' data format for WGS variant calls.
#' Bioinformatics.
#'
#' @references Danecek P, Auton A, Abecasis G et al. (2011)
#' The variant call format and VCFtools.
#' Bioinformatics, 27, 2156-2158.
#' @seealso
#' \code{\link{filter_ld}}
#' \code{\link[radiator]{tidy_genomic_data}}
#' \code{\link[radiator]{tidy_vcf}}
#' @examples
#' \dontrun{
#' # require(SeqArray)
#' # with built-in defaults:
#'
#' data <- radiator::read_vcf(data = "populations.snps.vcf")
#'
#' # Because no strata file is provided, the individuals are assumed to be in
#' # the same group.
#'
#' # If you need to filter the VCF I recommend using ?radiator::filter_rad
#'
#' # But if you're certain about the filters thresholds, additional arguments are available:
#'
#' data <- radiator::read_vcf(
#' data = "populations.snps.vcf",
#' strata = "strata_salamander.tsv",
#' path.folder = "salamander",
#' filter.individuals.missing = "outliers",
#' filter.common.markers = TRUE,
#' filter.strands = "blacklist",
#' filter.ma = 4,
#' filter.genotyping = 0.3,
#' filter.snp.position.read = "outliers",
#' filter.short.ld = "mac",
#' filter.long.ld = NULL,
#' verbose = TRUE
#' )
#'
#' # In a new R session, to re-open the GDS file/connection:
#'
#' data <- radiator::read_rad(data = "radiator_20200911@0748.gds")
#' }
read_vcf <- function(
data,
strata = NULL,
filename = NULL,
vcf.stats = TRUE,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
) {
# #Test
# data = "populations.snps.vcf"
# strata <- NULL
# filename <- NULL
# parallel.core <- parallel::detectCores() - 1
# verbose <- TRUE
# path.folder = NULL
# internal <- FALSE
# random.seed <- NULL
# markers.info = NULL
# vcf.stats <- TRUE
# vcf.metadata = TRUE
# filter.strands = "blacklist"
# filter.coverage = NULL
# filter.genotyping <- NULL
# filter.snp.position.read <- NULL
# filter.ma <- NULL
# filter.common.markers = FALSE
# filter.monomorphic <- FALSE
# filter.short.ld <- NULL
# filter.long.ld <- NULL
# long.ld.missing <- TRUE
# ld.method <- "r2"
# blacklist.id = NULL
# pop.select = NULL
# pop.levels = NULL
# pop.labels = NULL
# whitelist.markers = NULL
# subsample.markers.stats = 0.2
# parameters <- NULL
# filter.individuals.missing <- NULL
# filter.individuals.coverage.total <- NULL
# filter.individuals.coverage.median <- NULL
# filter.individuals.coverage.iqr <- NULL
# filter.snp.number <- NULL
## With filters
# filter.individuals.missing <- "outliers"
# filter.individuals.coverage.total <- "outliers"
# filter.short.ld <- "mac"
# filter.long.ld <- 0.3
# filter.common.markers = TRUE
# filter.monomorphic <- TRUE
# filter.ma <- 4
# filter.coverage = c(5, 150)
# filter.genotyping <- 0.15
# filter.snp.position.read <- "outliers"
# filter.snp.number <- "outliers"
# Cleanup---------------------------------------------------------------------
radiator_function_header(f.name = "read_vcf", verbose = verbose)
file.date <- format(Sys.time(), "%Y%m%d@%H%M")
if (verbose) message("Execution date@time: ", file.date)
old.dir <- getwd()
opt.change <- getOption("width")
options(future.globals.maxSize = Inf)
options(width = 70)
timing <- radiator_tic()
on.exit(setwd(old.dir), add = TRUE)
on.exit(options(width = opt.change), add = TRUE)
on.exit(radiator_toc(timing, verbose = verbose), add = TRUE)
on.exit(radiator_function_header(f.name = "read_vcf", start = FALSE, verbose = verbose), add = TRUE)
res <- list()
# Required package -----------------------------------------------------------
radiator_packages_dep(package = "SeqArray", cran = FALSE, bioc = TRUE)
# Checking for missing and/or default arguments ------------------------------
if (missing(data)) rlang::abort("vcf file missing")
# Function call and dotslist -------------------------------------------------
rad.dots <- radiator_dots(
func.name = as.list(sys.call())[[1]],
fd = rlang::fn_fmls_names(),
args.list = as.list(environment()),
dotslist = rlang::dots_list(..., .homonyms = "error", .check_assign = TRUE),
keepers = c("whitelist.markers",
"filter.ma", "filter.snp.position.read", "filter.snp.number",
"filter.coverage", "filter.genotyping", "filter.short.ld",
"filter.long.ld", "long.ld.missing", "ld.method",
"filter.individuals.missing", "filter.individuals.coverage.total",
"filter.individuals.coverage.median", "filter.individuals.coverage.iqr",
"filter.common.markers", "filter.monomorphic",
"filter.strands",
"blacklist.id", "pop.select", "pop.levels", "pop.labels",
"path.folder",
"markers.info", "vcf.metadata",
"subsample.markers.stats", "parameters", "random.seed", "internal"),
verbose = FALSE
)
if (!is.null(filter.snp.position.read) ||
!is.null(filter.ma) ||
!is.null(filter.coverage) ||
!is.null(filter.genotyping) ||
!is.null(filter.short.ld) ||
!is.null(filter.long.ld) ||
!is.null(long.ld.missing) ||
!is.null(filter.individuals.missing) ||
!is.null(filter.individuals.coverage.total) ||
!is.null(filter.snp.number)
) subsample.markers.stats <- 1
if (!is.null(ld.method)) {
ld.method <- match.arg(ld.method, c("composite", "r", "r2", "dprime", "corr"))
}
if (!is.null(filter.snp.position.read)) {
filter.snp.position.read <- match.arg(
arg = filter.snp.position.read,
choices = c("outliers", "iqr", "q75"),
several.ok = TRUE)
}
if (is.logical(vcf.metadata)) {
if (vcf.metadata) {
overwrite.metadata <- NULL
} else {
overwrite.metadata <- "GT"
}
} else {#NULL or character
if (is.null(vcf.metadata)) {
overwrite.metadata <- NULL
vcf.metadata <- TRUE
} else {
overwrite.metadata <- vcf.metadata
if (!"GT" %in% overwrite.metadata) {
message("GT field always included in vcf.metadata")
overwrite.metadata <- c("GT", overwrite.metadata)
}
vcf.metadata <- TRUE
}
}
if (is.null(random.seed)) {
random.seed <- sample(x = 1:1000000, size = 1)
set.seed(random.seed)
} else {
set.seed(random.seed)
}
# LD
# currently: will only filter long ld if short ld as been taken care of first...
if (!is.null(filter.long.ld) && is.null(filter.short.ld)) {
message("\nfilter.short.ld argument set by default to: maf")
filter.short.ld <- "mac"
}
# Folders---------------------------------------------------------------------
wf <- path.folder <- generate_folder(
f = path.folder,
rad.folder = "read_vcf",
prefix_int = FALSE,
internal = internal,
file.date = file.date,
verbose = verbose)
radiator.folder <- generate_folder(
f = path.folder,
rad.folder = "import_gds",
prefix_int = TRUE,
internal = internal,
file.date = file.date,
verbose = verbose)
# write the dots file
write_rad(
data = rad.dots,
path = radiator.folder,
filename = stringi::stri_join("radiator_read_vcf_args_", file.date, ".tsv"),
tsv = TRUE,
internal = internal,
write.message = "Function call and arguments stored in: ",
verbose = verbose
)
if (is.null(filename)) {
ind.file <- stringi::stri_join("vcf_individuals_info_", file.date, ".tsv")
markers.file <- stringi::stri_join("vcf_markers_metadata_", file.date, ".tsv")
blacklist.markers <- stringi::stri_join("blacklist.markers_", file.date, ".tsv")
blacklist.id.filename <- stringi::stri_join("blacklist.individuals_", file.date, ".tsv")
} else {
ind.file <- stringi::stri_join(filename, "_vcf_individuals_info_", file.date, ".tsv")
markers.file <- stringi::stri_join(filename, "_vcf_markers_metadata_", file.date, ".tsv")
blacklist.markers <- stringi::stri_join(filename, "_blacklist.markers_", file.date, ".tsv")
blacklist.id.filename <- stringi::stri_join(filename, "_blacklist.individuals_", file.date, ".tsv")
}
filename <- generate_filename(
name.shortcut = filename,
path.folder = radiator.folder,
extension = "gds")
filename.short <- filename$filename.short
filename <- filename$filename
# Random seed ----------------------------------------------------------------
readr::write_lines(x = random.seed, file = file.path(radiator.folder, "random.seed"))
if (verbose) message("File written: random.seed (", random.seed,")")
# VCF: Read ------------------------------------------------------------------
# Get file size
big.vcf <- file.size(data)
if (big.vcf >= 20000000000) {
cli::cli_progress_step("Reading VCF... you have time for a break...")
} else if (big.vcf >= 1000000000) {
cli::cli_progress_step("Reading VCF... you might have time for an espresso!")
} else {
cli::cli_progress_step("Reading VCF")
}
parallel.temp <- parallel.core
# for small VCF SeqArray not that good with parallel...
if (big.vcf < 10000000) parallel.temp <- FALSE
big.vcf <- NULL
# Check for bad header generated by stacks
detect.source <- check_header_source_vcf(data)
stacks.2 <- detect.source$data.source
data.source <- stringi::stri_replace_all_regex(
str = detect.source$check.header$header$value
[detect.source$check.header$header$id == "source"],
pattern = "[\"]", replacement = "", vectorize_all = FALSE)
if (length(data.source) == 0) data.source <- "unknown"
check.header <- detect.source$check.header
dp <- "DP" %in% detect.source$check.header$format$ID # Check that DP is valid
if (!is.null(detect.source$markers.info)) markers.info <- detect.source$markers.info
overwrite.metadata <- NULL # by default
if (!is.null(detect.source$overwrite.metadata)) overwrite.metadata <- detect.source$overwrite.metadata
vcf_read_temp <- function(
data, filename, parallel.temp, check.header, markers.info, overwrite.metadata) {
SeqArray::seqVCF2GDS(
vcf.fn = data,
out.fn = filename,
parallel = parallel.temp,
storage.option = "ZIP_RA",
verbose = FALSE,
header = check.header,
info.import = markers.info, # characters, the variable name(s) in the INFO field for import; or NULL for all variables
fmt.import = overwrite.metadata# characters, the variable name(s) in the FORMAT field for import; or NULL for all variables#
)
}#End vcf_read_temp
safe_vcf_read <- purrr::safely(.f = vcf_read_temp)
data.safe <- safe_vcf_read(data, filename, parallel.temp, check.header, markers.info, overwrite.metadata)
if (is.null(data.safe$error)) {
gds <- SeqArray::seqOpen(gds.fn = data.safe$result, readonly = FALSE)
} else {
cli::cli_progress_step("Reading VCF in parallel resulted in an error")
os <- Sys.info()[['sysname']]
if (os == "Windows") {
cli::cli_progress_step("Common problem with Windows machines")
} else {
cli::cli_progress_step("Not normal with UNIX machines: check your SeqArray installation")
}
cli::cli_progress_step("Reading VCF with parallel turned OFF")
gds <- SeqArray::seqVCF2GDS(
vcf.fn = data,
out.fn = filename,
parallel = FALSE,
storage.option = "ZIP_RA",
verbose = FALSE,
header = check.header,
info.import = markers.info, # characters, the variable name(s) in the INFO field for import; or NULL for all variables
fmt.import = overwrite.metadata# characters, the variable name(s) in the FORMAT field for import; or NULL for all variables#
) %>%
SeqArray::seqOpen(gds.fn = ., readonly = FALSE)
}
#no longer used
vcf_read_temp <- data.safe <- safe_vcf_read <- overwrite.metadata <- NULL
parallel.temp <- check.header <- detect.source <- markers.info <- NULL
cli::cli_progress_done()
if (verbose) message("Analyzing VCF")
# VCF: Summary ----------------------------------------------------------------
# Generate radiator skeleton -------------------------------------------------
radiator.gds <- radiator_gds_skeleton(gds)
# VCF: source ----------------------------------------------------------------
update_radiator_gds(gds = gds, node.name = "data.source", value = data.source)
if (verbose) message("VCF source: ", data.source)
# VCF: bi- or multi-alllelic--------------------------------------------------
biallelic <- detect_biallelic_markers(data = gds, verbose = verbose)
# stacks haplotype VCF...
if (!biallelic && stacks.2) dp <- FALSE
# VCF clean sample id---------------------------------------------------------
individuals.vcf <- tibble::tibble(
INDIVIDUALS_VCF = SeqArray::seqGetData(gds, "sample.id")
) %>%
dplyr::mutate(INDIVIDUALS_CLEAN = radiator::clean_ind_names(INDIVIDUALS_VCF))
if (!identical(individuals.vcf$INDIVIDUALS_VCF, individuals.vcf$INDIVIDUALS_CLEAN)) {
if (verbose) message("Cleaning VCF's sample names")
clean.id.filename <- stringi::stri_join("cleaned.vcf.id.info_", file.date, ".tsv")
readr::write_tsv(
x = individuals.vcf,
file = file.path(radiator.folder, clean.id.filename)
)
if (verbose) message("File written: ", clean.id.filename)
update_radiator_gds(gds = gds, node.name = "id.clean", value = individuals.vcf)
}
# replace id in VCF
update_radiator_gds(
gds = gds,
radiator.gds = FALSE,
node.name = "sample.id",
value = individuals.vcf$INDIVIDUALS_CLEAN,
replace = TRUE)
individuals <- dplyr::select(individuals.vcf, INDIVIDUALS = INDIVIDUALS_CLEAN)
individuals.vcf <- NULL
# VCF sync id with STRATA------------------------------------------------------
strata <- radiator::read_strata(
strata = strata,
pop.levels = pop.levels,
pop.labels = pop.labels,
pop.select = pop.select,
blacklist.id = blacklist.id,
keep.two = FALSE) %$% strata
if (!is.null(strata)) {
id.levels <- individuals$INDIVIDUALS
individuals %<>%
dplyr::left_join(
join_strata(individuals, strata, verbose = verbose) %>%
dplyr::mutate(FILTERS = "whitelist")
, by = "INDIVIDUALS"
) %>%
dplyr::mutate(FILTERS = tidyr::replace_na(data = FILTERS, replace = "filter.stata"))
bl <- dplyr::filter(individuals, FILTERS != "whitelist")
if (nrow(bl) != 0) {
if (verbose) message(" Number of sample blacklisted by the stata: ", nrow(bl))
}
# renaming ?
if (rlang::has_name(individuals, "NEW_ID")) {
if (!identical(id.levels, individuals$INDIVIDUALS)) {
rlang::abort("Wrong id order, contact author")
}
# replace id in GDS
update_radiator_gds(
gds = gds,
radiator.gds = FALSE,
node.name = "sample.id",
value = individuals$NEW_ID,
replace = TRUE)
individuals %<>% dplyr::rename(INDIVIDUALS = NEW_ID, OLD_ID = INDIVIDUALS)
}
} else {
individuals %<>% dplyr::mutate(STRATA = "1pop", FILTERS = "whitelist")
}
strata <- generate_strata(data = dplyr::filter(individuals, FILTERS == "whitelist"), pop.id = FALSE)
# generate integers for individuals and strata -------------------------------
# this makes it easier to sort and play with data (integers takes less mem...)
if (!is.factor(individuals$STRATA)) individuals$STRATA <- factor(individuals$STRATA)
individuals %<>%
dplyr::mutate(
ID_SEQ = seq_len(length.out = dplyr::n()),
STRATA_SEQ = as.integer(factor(x = STRATA, levels = levels(STRATA)))
)
#Update GDS node
update_radiator_gds(gds = gds, node.name = "individuals.meta", value = individuals, sync = TRUE)
# VCF: Markers metadata ------------------------------------------------------
markers.meta <- extract_markers_metadata(gds = gds)
# VCF: reference genome or de novo -------------------------------------------
ref.genome <- detect_ref_genome(data = gds, verbose = verbose)
# Cleaning VCF stacks ipyrad plink etc ..-------------------------------------
message("\nCleaning VCF")
# Stacks specific adjustments
if (!ref.genome) {
if (stacks.2) {
markers.meta %<>%
dplyr::mutate(
LOCUS = CHROM,
CHROM = "1",
COL = as.integer(POS) - 1
)
} else {
markers.meta %<>% dplyr::mutate(CHROM = "1")
}
}
# ipyrad VCF
if (stringi::stri_detect_fixed(str = data.source, pattern = "ipyrad")) {
markers.meta %<>% dplyr::mutate(
LOCUS = stringi::stri_replace_all_fixed(
str = CHROM,
pattern = "locus_",
replacement = "",
vectorize_all = FALSE
),
COL = as.integer(POS))
}
# PLINK vcf
if (stringi::stri_detect_fixed(str = data.source, pattern = "PLINK")) {
markers.meta %<>% dplyr::mutate(
LOCUS = as.integer(factor(x = CHROM)),
COL = 1L)
}
# GATK, platypus, freebayes and samtools specific adjustment
# Locus with NA or . or ""
weird.locus <- length(unique(markers.meta$LOCUS)) <= 1
if (weird.locus && !stacks.2) {
message("LOCUS field empty... adding unique id instead")
markers.meta$LOCUS <- markers.meta$VARIANT_ID
}
# VCF: LOCUS cleaning and Strands detection ----------------------------------
if (isTRUE(unique(stringi::stri_detect_fixed(
str = markers.meta[1,3],
pattern = c("|", "-", ":", ">")
)))
) {
data.source <- "dart.vcf"
update_radiator_gds(gds = gds, node.name = "data.source", value = data.source)
}
if (data.source != "dart.vcf" && stringi::stri_detect_regex(str = markers.meta[1,3], pattern = "[^[:alnum:]]+")) {
locus.sep <- unique(stringi::stri_extract_all_regex(
str = markers.meta[1,3],
pattern = "[^a-zA-Z0-9-+-]",
omit_no_match = TRUE)[[1]])
markers.meta <- suppressWarnings(
tidyr::separate(
data = markers.meta,
col = LOCUS, into = c("LOCUS", "COL", "STRANDS"),
sep = locus.sep,
extra = "drop", fill = "warn",
remove = TRUE, convert = TRUE)
)
if (anyNA(markers.meta$STRANDS)) markers.meta$STRANDS <- NA_character_
locus.sep <- NULL
detect.strand <- any(stringi::stri_detect_fixed(str = unique(markers.meta$STRANDS), pattern = "+"))
if (anyNA(detect.strand)) detect.strand <- FALSE
} else {
detect.strand <- FALSE
}
# VCF: DART LOCUS cleaning----------------------------------------------------
if (data.source == "dart.vcf") {
markers.meta %<>%
dplyr::mutate(
CHROM = stringi::stri_replace_all_fixed(
str = CHROM, pattern = ".", replacement = "denovo", vectorize_all = FALSE),
POS = stringi::stri_replace_na(str = POS, replacement = "50"),
COL = stringi::stri_extract_first_regex(str = LOCUS, pattern = "[-][0-9]+[\\:]"),
COL = stringi::stri_replace_all_fixed(str = COL, pattern = c("-", ":"), replacement = c("", ""), vectorize_all = FALSE),
COL = as.integer(COL),
LOCUS = stringi::stri_extract_first_regex(str = LOCUS, pattern = "^[0-9]+"),
LOCUS = stringi::stri_join(LOCUS, POS, sep = "_"),
POS = COL
)
}
# Generate MARKERS column and fix types --------------------------------------
markers.meta %<>%
dplyr::mutate(
dplyr::across(
.cols = c(CHROM, LOCUS, POS),
.fns = radiator::clean_markers_names
)
) %>%
dplyr::mutate(
MARKERS = stringi::stri_join(CHROM, LOCUS, POS, sep = "__"),
REF = SeqArray::seqGetData(gdsfile = gds, var.name = "$ref"),
ALT = SeqArray::seqGetData(gdsfile = gds, var.name = "$alt")
)
# PLINK file with duplicate markers... sometimes tagged ISOFORMS...
dup.markers <- length(markers.meta$MARKERS) - length(unique(markers.meta$MARKERS))
if (dup.markers > 0) {
message("\nNumber of duplicate MARKERS id: ", dup.markers)
message("Adding integer to differentiate")
markers.meta %<>%
dplyr::arrange(MARKERS) %>%
dplyr::mutate(MARKERS_NEW = MARKERS) %>%
dplyr::group_by(MARKERS_NEW) %>%
dplyr::mutate(
MARKERS = stringi::stri_join(MARKERS, seq(1, n(), by = 1), sep = "_")
) %>%
dplyr::ungroup(.) %>%
dplyr::select(-MARKERS_NEW)
}
# strip markers meta, here we could think of using variant id but let's wait.
markers.meta %<>%
dplyr::mutate(
M_SEQ = as.integer(factor(x = MARKERS, levels = unique(MARKERS)))
)
# ADD MARKERS META to GDS
update_radiator_gds(gds = gds, node.name = "markers.meta", value = markers.meta)
# replace chromosome info in GDS
# Why ? well snp ld e.g. will otherwise be performed by chromosome and with de novo data = by locus...
update_radiator_gds(
gds = gds,
radiator.gds = FALSE,
node.name = "chromosome",
value = markers.meta$CHROM,
replace = TRUE
)
# Filters --------------------------------------------------------------------
# # radiator_parameters: generate --------------------------------------------
filters.parameters <- radiator_parameters(
generate = TRUE,
initiate = FALSE,
update = FALSE,
parameter.obj = parameters,
path.folder = radiator.folder,
file.date = file.date,
verbose = verbose,
internal = internal)
# radiator_parameters: initiate --------------------------------------------
# with original VCF's values
filters.parameters <- radiator_parameters(
generate = FALSE,
initiate = TRUE,
update = TRUE,
parameter.obj = filters.parameters,
data = gds,
filter.name = "vcf",
param.name = "original values in VCF + strata",
values = "",
path.folder = path.folder,
file.date = file.date,
internal = internal,
verbose = verbose
)
# Filter duplicated SNPs on different strands---------------------------------
if (detect.strand) {
blacklist.strands <- markers.meta %>%
dplyr::distinct(VARIANT_ID, MARKERS, CHROM, LOCUS, POS) %>%
dplyr::group_by(CHROM, POS) %>%
dplyr::mutate(n = n()) %>%
dplyr::ungroup(.) %>%
dplyr::filter(n > 1) %>%
dplyr::select(-n) %>%
dplyr::distinct(CHROM, POS, .keep_all = TRUE)
if (nrow(blacklist.strands) > 0) {
if (verbose) {
message("\nDetected ", nrow(blacklist.strands)," duplicate SNPs on different strands (+/-)")
message(" By default radiator prune those SNPs")
message(" To change this behavior, use argument: filter.strands")
}
# if (verbose) message("\nFiltering duplicated markers on different strands")
if (filter.strands == "keep.both") {
message("Keeping both duplicated markers")
}
if (filter.strands == "best.stats") {
# bk variant
variant.bk <- markers.meta$VARIANT_ID
sync_gds(gds = gds, variant.id = blacklist.strands$VARIANT_ID)
blacklist.strands <- SeqArray::seqAlleleCount(
gdsfile = gds,
ref.allele = NULL,
parallel = parallel.core) %>%
unlist(.) %>%
matrix(
data = .,
nrow = nrow(blacklist.strands), ncol = 2, byrow = TRUE,
dimnames = list(rownames = blacklist.strands$VARIANT_ID,
colnames = c("REF_COUNT", "ALT_COUNT"))) %>%
tibble::as_tibble(., rownames = "VARIANT_ID") %>%
tibble::add_column(.data = .,
MARKERS = blacklist.strands$MARKERS,
CHROM = blacklist.strands$CHROM,
POS = blacklist.strands$POS,
MISSING_PROP = SeqArray::seqMissing(
gdsfile = gds,
per.variant = TRUE,
parallel = parallel.core
),
.after = 1) %>%
dplyr::mutate(
MAC = dplyr::if_else(ALT_COUNT < REF_COUNT, ALT_COUNT, REF_COUNT),
ALT_COUNT = NULL, REF_COUNT = NULL) %>%
dplyr::group_by(CHROM, POS) %>%
dplyr::filter(MISSING_PROP == max(MISSING_PROP)) %>%
dplyr::filter(MAC == min(MAC)) %>%
dplyr::ungroup(.) %>%
dplyr::arrange(CHROM, POS) %>%
dplyr::distinct(CHROM, POS, .keep_all = TRUE) %>%
dplyr::select(MARKERS)
# come back to original variant
sync_gds(gds = gds, variant.id = variant.bk, verbose = FALSE)
}# End best stats
if (filter.strands == "blacklist") blacklist.strands %<>% dplyr::distinct(MARKERS)
if (filter.strands != "keep.both") {
# Folders
path.folder.strands <- generate_folder(
f = path.folder,
rad.folder = "filter_duplicated_markers",
internal = FALSE,
file.date = file.date,
verbose = verbose)
# update the blacklist
blacklist.strands %<>%
dplyr::mutate(FILTER = "filter.strands")
write_rad(
data = blacklist.strands,
path = path.folder.strands,
filename = stringi::stri_join("blacklist.duplicated.markers.strands_",
file.date, ".tsv"),
tsv = TRUE, internal = FALSE, verbose = verbose)
# bl.gds <- update_bl_markers(gds = gds, update = blacklist.strands)
# Update markers.meta
markers.meta %<>%
dplyr::mutate(
FILTERS = dplyr::if_else(
MARKERS %in% blacklist.strands$MARKERS, "filter.strands", FILTERS
)
)
# check <- markers.meta
write_rad(
data = dplyr::filter(markers.meta, FILTERS == "whitelist"),
path = path.folder.strands,
filename = stringi::stri_join("whitelist.duplicated.markers.strands_",
file.date, ".tsv"),
tsv = TRUE, internal = FALSE, verbose = verbose)
# Update GDS
update_radiator_gds(
gds = gds,
node.name = "markers.meta",
value = markers.meta,
sync = TRUE)
# Filter parameter file: update
filters.parameters <- radiator_parameters(
generate = FALSE,
initiate = FALSE,
update = TRUE,
parameter.obj = filters.parameters,
data = gds,
filter.name = "filter duplicated markers on different strands",
param.name = "filter.strands",
values = filter.strands,
path.folder = path.folder,
file.date = file.date,
internal = internal,
verbose = verbose)
# results
radiator_results_message(
rad.message = stringi::stri_join("\nFilter duplicated markers on different strands: ",
filter.strands),
filters.parameters,
internal,
verbose
)
}#End filter.strands
}
}#Here
blacklist.strands <- path.folder.strands <- NULL
# Filter the vcf's FILTER column ---------------------------------------------
# markers.meta$FILTER_VCF
filter.vcf <- tibble::tibble(
FILTER_VCF = SeqArray::seqGetData(
gds, "annotation/filter")
) %>%
dplyr::bind_cols(
markers.meta %>%
dplyr::filter(FILTERS == "whitelist") %>%
dplyr::select(VARIANT_ID)
)
filter.check.unique <- unique(filter.vcf$FILTER_VCF)
if (length(filter.check.unique) > 1) {
message("Filtering markers based on VCF FILTER column")
filter.vcf %<>% dplyr::filter(FILTER_VCF != "PASS")
markers.meta %<>%
dplyr::mutate(
FILTERS = dplyr::if_else(
VARIANT_ID %in% filter.vcf$VARIANT_ID,
"vcf filter column",
FILTERS
)
)
update_radiator_gds(
gds = gds,
node.name = "markers.meta",
value = markers.meta,
sync = TRUE)
# radiator_parameters: update
filters.parameters <- radiator_parameters(
generate = FALSE,
initiate = FALSE,
update = TRUE,
parameter.obj = filters.parameters,
data = gds,
filter.name = "vcf filter column",
param.name = "PASS or not",
path.folder = path.folder,
file.date = file.date,
verbose = verbose)
# results ------------------------------------------------------------------
radiator_results_message(
rad.message = stringi::stri_join("\nFilter vcf filter column: PASS or not"),
filters.parameters,
internal,
verbose
)
}
filter.check.unique <- filter.vcf <- markers.meta <- individuals <- NULL
# Filter_monomorphic----------------------------------------------------------
gds <- filter_monomorphic(
data = gds,
filter.monomorphic = filter.monomorphic,
parallel.core = parallel.core,
verbose = FALSE,
parameters = filters.parameters,
path.folder = wf,
internal = FALSE)
# Filter common markers ------------------------------------------------------
gds <- filter_common_markers(
data = gds,
filter.common.markers = filter.common.markers,
fig = TRUE,
parallel.core = parallel.core,
verbose = FALSE,
parameters = filters.parameters,
path.folder = wf,
internal = FALSE)
# Filter Whitelist------------------------------------------------------------
gds <- filter_whitelist(
data = gds,
whitelist.markers = whitelist.markers,
verbose = FALSE,
path.folder = wf,
parameters = filters.parameters,
biallelic = biallelic,
internal = FALSE)
# Filter_individuals----------------------------------------------------------
gds <- filter_individuals(
data = gds,
interactive.filter = FALSE,
filter.individuals.missing = filter.individuals.missing,
filter.individuals.heterozygosity = NULL,
filter.individuals.coverage.total = filter.individuals.coverage.total,
filter.individuals.coverage.median = filter.individuals.coverage.median,
filter.individuals.coverage.iqr = filter.individuals.coverage.iqr,
parallel.core = parallel.core,
verbose = FALSE,
path.folder = wf,
parameters = filters.parameters,
internal = FALSE
)
# Filter MAC----------------------------------------------------------------
gds <- filter_ma(
data = gds,
interactive.filter = FALSE,
ma.stats = ma.stats,
filter.ma = filter.ma,
filename = NULL,
parallel.core = parallel.core,
verbose = FALSE,
parameters = filters.parameters,
path.folder = wf,
internal = FALSE)
# Filter mean coverage------------------------------------------------------
gds <- filter_coverage(
data = gds,
interactive.filter = FALSE,
filter.coverage = filter.coverage,
filename = NULL,
parallel.core = parallel.core,
verbose = FALSE,
parameters = filters.parameters,
path.folder = wf,
internal = FALSE)
# Filter genotyping---------------------------------------------------------
gds <- filter_genotyping(
data = gds,
interactive.filter = FALSE,
filter.genotyping = filter.genotyping,
filename = NULL,
parallel.core = parallel.core,
verbose = FALSE,
parameters = filters.parameters,
path.folder = wf,
internal = FALSE)
# Filter SNP position on the read---------------------------------------------
gds <- filter_snp_position_read(
data = gds,
interactive.filter = FALSE,
filter.snp.position.read = filter.snp.position.read,
filename = NULL,
parallel.core = parallel.core,
verbose = FALSE,
parameters = filters.parameters,
path.folder = wf,
internal = FALSE)
# Filter number of SNPs per locus --------------------------------------------
gds <- filter_snp_number(
data = gds,
interactive.filter = FALSE,
filter.snp.number = filter.snp.number,
filename = NULL,
parallel.core = parallel.core,
verbose = FALSE,
parameters = filters.parameters,
path.folder = wf,
internal = FALSE)
# Filter Linkage disequilibrium --------------------------------------------
gds <- filter_ld(
data = gds,
interactive.filter = FALSE,
filter.short.ld = filter.short.ld,
filter.long.ld = filter.long.ld,
parallel.core = parallel.core,
filename = NULL,
verbose = FALSE,
long.ld.missing = long.ld.missing,
ld.method = ld.method,
parameters = filters.parameters,
path.folder = wf,
internal = FALSE)
# Final Sync GDS -----------------------------------------------------------
if (verbose) message("\nPreparing output files...")
markers.meta <- extract_markers_metadata(gds, whitelist = TRUE)
strata <- extract_individuals_metadata(
gds = gds,
ind.field.select = c("INDIVIDUALS", "STRATA"),
whitelist = TRUE)
sync_gds(gds = gds)
# generate a folder to put the stats...
path.folder <- generate_folder(
f = wf,
rad.folder = "filtered",
internal = FALSE,
file.date = file.date,
verbose = verbose)
# Whitelist
write_rad(data = markers.meta,
path = path.folder,
filename = "whitelist.markers.tsv",
tsv = TRUE,
write.message = "standard",
verbose = verbose)
# blacklist
bl <- extract_markers_metadata(gds, blacklist = TRUE)
if (nrow(bl) > 0) {
write_rad(data = bl,
path = path.folder,
filename = "blacklist.markers.tsv",
tsv = TRUE,
write.message = "standard",
verbose = verbose)
}
# writing the blacklist of id
blacklist.id <- extract_individuals_metadata(gds, blacklist = TRUE)
if (nrow(blacklist.id) > 0) {
write_rad(data = blacklist.id,
path = path.folder,
filename = "blacklist.id.tsv",
tsv = TRUE,
write.message = "standard",
verbose = verbose)
}
# Generate new strata
write_rad(data = strata,
path = path.folder,
filename = "strata.filtered.tsv",
tsv = TRUE,
write.message = "standard",
verbose = verbose)
#----------------------------------- VCF STATS ----------------------------
if (vcf.stats) {
message("\nVCF statistics per individuals and markers")
# check for problem with large vector size
n.markers <- length(markers.meta$VARIANT_ID)
n.ind <- as.numeric(length(strata$INDIVIDUALS))
n.snp <- as.numeric(n.markers) # numeric required it's not just a bk
large.data <- n.ind * n.snp
vector.size.limit <- 2^31
if (n.markers < 200000) subsample.markers.stats <- 1
if (large.data > vector.size.limit) {
if (subsample.markers.stats == 1) message("Large VCF: markers subsampling turned ON for coverage statistics")
# check subsample
if (!n.snp * subsample.markers.stats * n.ind < vector.size.limit) {
subsample.markers.stats <- 0.3
if (!n.snp * subsample.markers.stats * n.ind < vector.size.limit) {
subsample.markers.stats <- 0.2
if (!n.snp * subsample.markers.stats * n.ind < vector.size.limit) {
subsample.markers.stats <- 0.1
if (!n.snp * subsample.markers.stats * n.ind < vector.size.limit) {
subsample.markers.stats <- 0.05
}
}
}
}
}
large.data <- vector.size.limit <- NULL
# with huge dataset subsampling is faster and still very reliable for overall
# statistics like the ones generated here...
# SUBSAMPLE markers
if (subsample.markers.stats < 1) {
message("Using a subsample proportion of markers: ", subsample.markers.stats)
markers.subsampled <- dplyr::sample_frac(
tbl = markers.meta,
size = subsample.markers.stats)
variant.select <- markers.subsampled$VARIANT_ID
subsample.filename <- stringi::stri_join("markers.subsampled_", file.date, ".tsv")
dplyr::select(markers.subsampled, MARKERS) %>%
dplyr::mutate(RANDOM_SEED = random.seed) %>%
readr::write_tsv(
x = .,
file = file.path(path.folder, subsample.filename))
markers.subsampled <- NULL
} else {
variant.select <- NULL
}
# Individuals and markers stats --------------------------------------------
i.m.stats <- generate_stats(
gds = gds,
subsample = variant.select,
exhaustive = TRUE,
force.stats = TRUE,
path.folder = path.folder,
plot = TRUE,
digits = 6,
file.date = file.date,
parallel.core = parallel.core,
verbose = TRUE
)
# For summary at the end of the function:
ind.missing <- round(i.m.stats$i.stats$MEAN[i.m.stats$i.stats$GROUP == "missing genotypes"], 2)
if (dp) ind.cov.total <- round(i.m.stats$i.stats$MEAN[i.m.stats$i.stats$GROUP == "total coverage"], 0)
if (dp) ind.cov.mean <- round(i.m.stats$i.stats$MEAN[i.m.stats$i.stats$GROUP == "mean coverage"], 0)
markers.missing <- round(i.m.stats$m.stats$MEAN[i.m.stats$m.stats$GROUP == "missing genotypes"], 2)
if (dp) markers.cov <- round(i.m.stats$m.stats$MEAN[i.m.stats$m.stats$GROUP == "mean coverage"], 0) # same as above because NA...
}#End stats
#RESULTS --------------------------------------------------------------------
number.info <- SeqArray::seqGetFilter(gdsfile = gds)
if (verbose) {
cat("################################### SUMMARY ####################################\n")
if (vcf.stats) {
message("\nVCF summary\nMissing data: ")
message(" markers: ", markers.missing)
message(" individuals: ", ind.missing)
if (dp) message("\n\nCoverage info:")
if (dp) message(" individuals mean total coverage: ", ind.cov.total)
if (dp) message(" individuals mean genotype coverage: ", ind.cov.mean)
if (dp) message(" markers mean coverage: ", markers.cov)
}
}
message("\nVCF info:")
message("Number of chromosome/contig/scaffold: ", length(unique(markers.meta$CHROM)))
message("Number of locus: ", length(unique(markers.meta$LOCUS)))
message("Number of markers: ", length(number.info$variant.sel[number.info$variant.sel]))
summary_strata(strata)
i.m.stats <- markers.meta <- NULL
message("radiator Genomic Data Structure (GDS) file: ", folder_short(filename))
return(gds)
} # End read_vcf
# tidy_vcf ---------------------------------------------------------------------
#' @name tidy_vcf
#' @title Tidy vcf file
#' @description The function allows to tidy a VCF file.
#'
#' Used internally in
#' \href{https://github.com/thierrygosselin/radiator}{radiator}
#' and might be of interest for users.
#'
#' It is highly recommended to use \code{\link[radiator]{filter_rad}} to reduce
#' the number of markers. Advance options below are also available to
#' to manipulate and prune the dataset with blacklists and whitelists along
#' several other filtering options.
#'
#' @param data (VCF file, character string) The VCF SNPs are biallelic or haplotypes.
#' To make the VCF population-ready, the argument \code{strata} is required.
#'
#'
#' @inheritParams radiator_common_arguments
#' @inheritParams tidy_genomic_data
#' @param ... (optional) To pass further argument for fine-tuning the tidying
#' (read below).
#' @export
#' @rdname tidy_vcf
#' @return The output in your global environment is a tidy data frame, the GDS file
#' generated is in the working directory under the name given during function execution.
#' @section VCF file format:
#'
#' \strong{PLINK:} radiator fills the \code{LOCUS} column of PLINK VCFs with
#' a unique integer based on the \code{CHROM} column
#' (\code{as.integer(factor(x = CHROM))}).
#' The \code{COL} column is filled with 1L for lack of bettern info on this.
#' Not what you need ? Open an issue on GitHub for a request.
#'
#' \strong{ipyrad:} the pattern \code{locus_} in the \code{CHROM} column
#' is removed and used. The \code{COL} column is filled with the same value as
#' \code{POS}.
#'
#' \strong{GATK:} Some VCF have an \code{ID} column filled with \code{.},
#' the LOCUS information is all contained along the linkage group in the
#' \code{CHROM} column. To make it work with
#' \href{https://github.com/thierrygosselin/radiator}{radiator},
#' the \code{ID} column is filled with the \code{POS} column info.
#' GATK with a mix of multi- and bi-allelic dataset won't generate VCF stats.
#'
#' \strong{platypus:} Some VCF files don't have an ID filed with values,
#' here the same thing is done as GATK VCF files above.
#'
#' \strong{freebayes:} Some VCF files don't have an ID filed with values,
#' here the same thing is done as GATK VCF files above.
#'
#' \strong{stacks:} with \emph{de novo} approaches, the CHROM column is
#' filled with "1", the LOCUS column correspond to the CHROM section in stacks VCF and
#' the COL column is POS -1. With a reference genome, the ID column in stacks VCF is
#' separated into "LOCUS", "COL", "STRANDS".
#'
#' \emph{stacks problem: } current version as some intrinsic problem with
#' missing allele depth info, during the tidying process a message will
#' highlight the number of genotypes impacted by the problem. When possible, the
#' problem is corrected by adding the read depth info into the allele depth field.
#' @section Advance mode, using \emph{dots-dots-dots ...}:
#'
#' The arguments below are not available using code completion (e.g. with TAB),
#' consequently any misspelling will generate an error or be ignored.
#'
#' \emph{dots-dots-dots ...} arguments names and values are reported and written
#' in the working directory when \code{internal = FALSE} and \code{verbose = TRUE}.
#'
#' \strong{General arguments: }
#' \enumerate{
#' \item \code{path.folder}: to write ouput in a specific path
#' (used internally in radiator). Default: \code{path.folder = getwd()}.
#' If the supplied directory doesn't exist, it's created.
#' \item \code{internal: } (optional, character)
#' Default (\code{internal = FALSE}). A folder is generated to write the files.
#' \item \code{random.seed}: (integer, optional) For reproducibility, set an integer
#' that will be used inside codes that uses randomness. With default,
#' a random number is generated, printed and written in the directory.
#' Default: \code{random.seed = NULL}.
#' \item \code{parameters} It's a parameter file where radiator output results of
#' filtering. Used internally.
#' Default: \code{parameters = NULL}.
#' }
#' \strong{tidying arguments/behavior:}
#' \enumerate{
#' \item \code{tidy.vcf:} (optional, logical)
#' Default: \code{tidy.vcf = TRUE}. But you can always stop the process after
#' the creation of the GDS file (equivalent of running \code{\link{read_vcf}}).
#' \item \code{tidy.check:} (optional, logical)
#' Default: \code{tidy.check = TRUE}. By default, the number of markers just before
#' tidying is checked. Tidying a VCF file with more than 20000 markers is
#' sub-optimal:
#' \itemize{
#' \item a computer with lots of RAM is required
#' \item it's very slow to generate
#' \item it's very slow to run codes after
#' \item for most non model species this number of markers is not realistic...
#' }
#' Consequently, the function execution is suspended and user are asked if they
#' still want to continue with the tidying or stop and keep the GDS file/object.
#'
#' This behavior can be annoying, \emph{if the user knows what he's doing}, to turn off
#' use: \code{tidy.check = FALSE}.
#' \item \code{calibrate.alleles: } (optional, logical)
#' Default: \code{calibrate.alleles = FALSE}.
#' Documented in \code{\link[radiator]{calibrate_alleles}}.
#' \item \code{vcf.stats: } (optional, logical) Generates individuals and
#' markers statistics helpful for filtering.
#' These are very fast to generate and because computational
#' cost is minimal, even for huge VCFs, the default is \code{vcf.stats = TRUE}.
#' \item \code{vcf.metadata: } (optional, logical or character string)
#' With \code{vcf.metadata = FALSE}, only the genotypes are kept (GT field)
#' in the tidy dataset.
#' With \code{vcf.metadata = TRUE},
#' all the metadata contained in the \code{FORMAT} field will be kept in
#' the tidy data file. radiator is currently keeping and cleaning these metadata:
#' \code{"DP", "AD", "GL", "PL", "GQ", "HQ", "GOF", "NR", "NV", "CATG"}.
#' e.g. you only want AD and PL, \code{vcf.metadata = c("AD", "PL")}.
#' Need another metadata ? Submit a request on github...
#' Default: \code{vcf.metadata = TRUE}.
#' }
#'
#' \strong{Filtering arguments:}
#' \enumerate{
#' \item \code{blacklist.id: } (optional, character)
#' Default (\code{blacklist.id = NULL}).
#' Documented in \code{\link[radiator]{tidy_genomic_data}}.
#' \item \code{filter.strands}: (optional, character)
#' Default (\code{filter.strands = "blacklist"}).
#' documented in \code{\link[radiator]{read_vcf}}.
#' \item \code{whitelist.markers: }(optional, path)
#' Default: \code{whitelist.markers = NULL}.
#' Documented in \code{\link[radiator]{filter_whitelist}}.
#' \item \code{filter.individuals.missing}: (double)
#' Default: \code{filter.individuals.missing = NULL}.
#' Documented in \code{\link[radiator]{filter_individuals}}.
#' \item \code{filter.monomorphic}: (logical)
#' Default: \code{filter.monomorphic = TRUE}.
#' Documented in \code{\link[radiator]{filter_monomorphic}}.
#' Required package: \code{UpSetR}.
#' \item \code{filter.common.markers}: (logical)
#' Default: \code{filter.common.markers = TRUE}.
#' Documented in \code{\link[radiator]{filter_common_markers}}.
#' Required package: \code{UpSetR}.
#' \item \code{filter.ma}: (integer)
#' Default: \code{filter.ma = NULL}.
#' Documented in \code{\link[radiator]{filter_ma}}.
#' \item \code{filter.coverage}: (logical)
#' Default: \code{filter.coverage = NULL}.
#' Documented in \code{\link[radiator]{filter_coverage}}.
#' \item \code{filter.genotyping}: (integer)
#' Default: \code{filter.genotyping = NULL}.
#' Documented in \code{\link[radiator]{filter_genotyping}}.
#' \item \code{filter.snp.position.read: } (optional, character, integer)
#' Default: \code{filter.snp.position.read = NULL}.
#' Documented in \code{\link[radiator]{filter_snp_position_read}}.
#' \item \code{filter.snp.number: } (optional, character, integer)
#' Default: \code{filter.snp.number = NULL}.
#' Documented in \code{\link[radiator]{filter_snp_number}}.
#' \item \code{filter.short.ld}: (optional, character)
#' Default: \code{filter.short.ld = NULL}.
#' Documented in \code{\link[radiator]{filter_ld}}.
#' \item \code{filter.long.ld}: (optional, character)
#' Default: \code{filter.long.ld = NULL}.
#' Documented in \code{\link[radiator]{filter_ld}}.
#' Required package: \code{SNPRelate}.
#' \item \code{long.ld.missing}: Documented in \code{\link[radiator]{filter_ld}}.
#' Default: \code{long.ld.missing = FALSE}.
#' \item \code{ld.method}: Documented in \code{\link[radiator]{filter_ld}}.
#' Default: \code{ld.method = "r2"}.
#' }
#'
#'
#' @examples
#' \dontrun{
#' # very basic with built-in defaults (not recommended):
#' prep.data <- radiator::tidy_vcf(data = "populations.snps.vcf")
#'
#' # Using more arguments and filters (recommended):
#' tidy.data <- radiator::tidy_vcf(
#' data = "populations.snps.vcf",
#' strata = "strata_salamander.tsv",
#' filter.individuals.missing = "outlier",
#' filter.ma = 4,
#' filter.genotyping = 0.1,
#' filter.snp.position.read = "outliers",
#' filter.short.ld = "mac",
#' path.folder = "salamander/prep_data",
#' verbose = TRUE)
#' }
#' @seealso \code{\link[radiator]{read_vcf}},
#' \code{\link[radiator]{tidy_genomic_data}},
#' \code{\link[radiator]{filter_rad}}
#' @references Danecek P, Auton A, Abecasis G et al. (2011)
#' The variant call format and VCFtools.
#' Bioinformatics, 27, 2156-2158.
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
tidy_vcf <- function(
data,
strata = NULL,
filename = NULL,
parallel.core = parallel::detectCores() - 1,
verbose = FALSE,
...) {
# # test
# data = "populations.snps.vcf"
# strata = "spis-popmap-448samples.tsv"
# vcf.stats = TRUE
# vcf.metadata = TRUE
# parallel.core = 8
# verbose = TRUE
# blacklist.id = NULL
# whitelist.markers = NULL
# filename = NULL
# internal = FALSE
# path.folder = NULL
# filter.individuals.missing = "outlier"
# filter.common.markers = TRUE
# filter.monomorphic = TRUE
# filter.strands = FALSE
# filter.ma = 4
# filter.coverage = TRUE
# filter.genotyping = 10
# filter.snp.position.read = "outliers"
# filter.short.ld = "maf"
# filter.long.ld = 0.8
# gt.vcf.nuc = TRUE
# gt.vcf = TRUE
# gt = TRUE
# gt.bin = TRUE
# wide = FALSE
# calibrate.alleles = FALSE
# random.seed = NULL
# parameters = NULL
# subsample.markers.stats <- 1
# Cleanup---------------------------------------------------------------------
file.date <- format(Sys.time(), "%Y%m%d@%H%M")
if (verbose) message("Execution date@time: ", file.date)
old.dir <- getwd()
opt.change <- getOption("width")
options(width = 70)
timing <- radiator_tic()
res <- list()
#back to the original directory and options
on.exit(setwd(old.dir), add = TRUE)
on.exit(options(width = opt.change), add = TRUE)
on.exit(radiator_toc(timing), add = TRUE)
on.exit(radiator_function_header(f.name = "tidy_vcf", start = FALSE, verbose = verbose), add = TRUE)
# Required package -----------------------------------------------------------
radiator_packages_dep(package = "SeqArray", cran = FALSE, bioc = TRUE)
# Checking for missing and/or default arguments ------------------------------
if (missing(data)) rlang::abort("vcf file missing")
# Function call and dotslist -------------------------------------------------
rad.dots <- radiator_dots(
func.name = as.list(sys.call())[[1]],
fd = rlang::fn_fmls_names(),
args.list = as.list(environment()),
dotslist = rlang::dots_list(..., .homonyms = "error", .check_assign = TRUE),
keepers = c(
"blacklist.id",
"filter.individuals.missing",
"filter.individuals.coverage.total",
"filter.common.markers",
"filter.monomorphic",
"filter.ma", "ma.stats",
"filter.coverage",
"filter.genotyping",
"filter.snp.position.read",
"filter.snp.number",
"filter.short.ld", "filter.long.ld", "long.ld.missing", "ld.method",
"filter.strands",
"random.seed",
"path.folder",
"parameters",
"gt", "gt.bin", "gt.vcf", "gt.vcf.nuc",
"calibrate.alleles",
"vcf.metadata", "vcf.stats",
"whitelist.markers",
"internal",
"tidy.check", "tidy.vcf"
),
verbose = FALSE
)
if (!is.null(filter.snp.position.read) ||
!is.null(filter.ma) ||
!is.null(filter.coverage) ||
!is.null(filter.genotyping) ||
!is.null(filter.short.ld) ||
!is.null(filter.long.ld) ||
!is.null(long.ld.missing) ||
!is.null(filter.individuals.missing) ||
!is.null(filter.individuals.coverage.total) ||
!is.null(filter.snp.number)) subsample.markers.stats <- 1
if (!is.null(ld.method)) {
ld.method <- match.arg(ld.method, c("composite", "r", "r2", "dprime", "corr"))
}
if (!is.null(filter.snp.position.read)) {
filter.snp.position.read <- match.arg(
arg = filter.snp.position.read,
choices = c("outliers", "iqr", "q75"),
several.ok = TRUE)
}
if (is.logical(vcf.metadata)) {
if (vcf.metadata) {
overwrite.metadata <- NULL
} else {
overwrite.metadata <- "GT"
}
} else {#NULL or character
if (is.null(vcf.metadata)) {
overwrite.metadata <- NULL
vcf.metadata <- TRUE
} else {
overwrite.metadata <- vcf.metadata
if (!"GT" %in% overwrite.metadata) {
message("GT field always included in vcf.metadata")
overwrite.metadata <- c("GT", overwrite.metadata)
}
vcf.metadata <- TRUE
}
}
if (is.null(random.seed)) {
random.seed <- sample(x = 1:1000000, size = 1)
set.seed(random.seed)
} else {
set.seed(random.seed)
}
# LD
# currently: will only filter long ld if short ld as been taken care of first...
if (!is.null(filter.long.ld) && is.null(filter.short.ld)) {
message("\nfilter.short.ld argument set by default to: maf")
filter.short.ld <- "mac"
}
if (is.null(tidy.vcf)) tidy.vcf <- TRUE
if (is.null(tidy.check)) tidy.check <- TRUE
# Folders---------------------------------------------------------------------
wf <- path.folder <- generate_folder(
f = path.folder,
rad.folder = "tidy_vcf",
prefix_int = FALSE,
internal = internal,
file.date = file.date,
verbose = verbose)
# import VCF -----------------------------------------------------------------
data <- radiator::read_vcf(
data = data,
strata = strata,
filename = filename,
verbose = FALSE,
parallel.core = parallel.core,
internal = TRUE,
vcf.stats = vcf.stats,
vcf.metadata = vcf.metadata,
path.folder = path.folder,
random.seed = random.seed,
parameters = parameters,
blacklist.id = blacklist.id,
filter.strands = filter.strands,
filter.monomorphic = filter.monomorphic,
filter.common.markers = filter.common.markers,
whitelist.markers = whitelist.markers,
filter.individuals.missing = filter.individuals.missing,
filter.individuals.coverage.total = filter.individuals.coverage.total,
filter.ma = filter.ma,
filter.coverage = filter.coverage,
filter.genotyping = filter.genotyping,
filter.snp.position.read = filter.snp.position.read,
filter.snp.number = filter.snp.number,
filter.short.ld = filter.short.ld,
filter.long.ld = filter.long.ld,
long.ld.missing = long.ld.missing,
ld.method = ld.method
)
# tidy_vcf folder ------------------------------------------------------------
tidy.folder <- generate_folder(
f = path.folder,
rad.folder = "tidy_vcf",
prefix_int = TRUE,
internal = FALSE,
file.date = file.date,
verbose = verbose)
# write the dots file: after the GDS import...
write_rad(
data = rad.dots,
path = tidy.folder,
filename = stringi::stri_join("radiator_tidy_vcf_args_", file.date, ".tsv"),
tsv = TRUE,
internal = internal,
verbose = verbose
)
# Random seed ----------------------------------------------------------------
readr::write_lines(x = random.seed, file = file.path(tidy.folder, "random.seed"))
if (verbose) message("File written: random.seed (", random.seed,")")
# Tidy the data --------------------------------------------------------------
if (!tidy.vcf) return(data)
# Get info markers and individuals -----------------------------------------
markers.meta <- extract_markers_metadata(gds = data, markers.meta.select = "M_SEQ", whitelist = TRUE)
n.markers <- nrow(markers.meta)
check <- extract_markers_metadata(gds = data, whitelist = TRUE)
# STRATEGY -----------------------------------------------------------------
if (tidy.check && n.markers > 20000) {
cat("\n\n################################## IMPORTANT ###################################\n")
message("Tidying vcf with ", n.markers, " SNPs is not optimal:")
message(" 1. a computer with lots of RAM is required")
message(" 2. it's very slow to generate")
message(" 3. it's very slow to run codes after")
message(" 4. for most non model species this number of markers is not realistic...")
message("\nRecommendations:")
message(" 1. use advance features available in this function (read doc)")
message(" 2. filter your dataset. e.g. with filter_rad")
message("\nIdeally target a maximum of ~ 10 000 - 20 0000 unlinked SNPs\n")
if (n.markers > 20000) tidy.vcf <- FALSE
tidy.vcf <- radiator_question(
x = "\nContinue tidying the VCF (y/n) ?",
answer.opt = c("Y", "N", "Yes", "No", "YES", "NO", "yes", "no", "y", "n"))
if (any(c("y", "Y", "Yes", "YES", "yes") %in% tidy.vcf)) {
message("Tidying the large vcf...")
} else {
message("\nKeeping the GDS object/file")
return(data)
}
}
# Print genotypes tidying
gt.bin <- TRUE
message("\nGenotypes formats generated with ", n.markers, " SNPs: ")
message(" GT_BIN (the dosage of ALT allele: 0, 1, 2 NA): ", gt.bin)
message(" GT_VCF (the genotype coding VCFs: 0/0, 0/1, 1/1, ./.): ", gt.vcf)
message(" GT_VCF_NUC (the genotype coding in VCFs, but with nucleotides: A/C, ./.): ", gt.vcf.nuc)
message(" GT (the genotype coding 'a la genepop': 001002, 001001, 000000): ", gt)
if (!is.null(blacklist.id)) calibrate.alleles <- TRUE
tidy.data <- gds2tidy(
gds = data,
markers.meta = markers.meta,
strip.rad = TRUE,
calibrate.alleles = FALSE # not done here
)
# bi- or multi-alllelic VCF ------------------------------------------------
biallelic <- detect_biallelic_markers(data = data)
if (is.logical(vcf.metadata)) {
overwrite.metadata <- "GT"
if (vcf.metadata) overwrite.metadata <- NULL
} else {#NULL or character
if (is.null(vcf.metadata)) {
overwrite.metadata <- NULL
vcf.metadata <- TRUE
} else {
overwrite.metadata <- vcf.metadata
if (!"GT" %in% overwrite.metadata) {
message("GT field always included in vcf.metadata")
overwrite.metadata <- c("GT", overwrite.metadata)
}
vcf.metadata <- TRUE
}
}
# stacks VCF haplotypes
if (!biallelic) {
data.source <- radiator::extract_data_source(gds = data)
if (stringi::stri_detect_fixed(str = data.source, pattern = "Stacks")) {
overwrite.metadata <- "GT"
}
}
# vcf.metadata ---------------------------------------------------------------
if (vcf.metadata) {
if (verbose) message("\nKeeping vcf genotypes metadata: yes")
# detect FORMAT fields available
have <- SeqArray::seqSummary(
gdsfile = data,
varname = "annotation/format",
check = "none", verbose = FALSE)$ID
if (length(have) > 0) {
want <- c("DP", "AD", "GL", "PL", "HQ", "GQ", "GOF", "NR", "NV", "CATG")
if (!is.null(overwrite.metadata)) want <- overwrite.metadata
parse.format.list <- purrr::keep(.x = have, .p = have %in% want)
# work on parallelization of this part
meta <- parse_gds_metadata(x = parse.format.list, gds = data, strip.rad = TRUE, verbose = verbose)
tidy.data %<>% dplyr::left_join(meta, by = intersect(colnames(tidy.data), colnames(meta)))
meta <- NULL
} else {
if (verbose) message(" genotypes metadata: none found")
vcf.metadata <- FALSE
}
# Check stacks AD problem
# Some genotypes with missing AD...
# NOTE TO MYSELF: might be faster to screen stacks here in data.source...
if (rlang::has_name(tidy.data, "C_DEPTH")) {
ref <- extract_markers_metadata(gds = data, markers.meta.select = c("M_SEQ", "REF", "ALT"), whitelist = TRUE)
tidy.data %<>% dplyr::left_join(ref, by = "M_SEQ")
ref <- NULL
tidy.data %<>%
dplyr::mutate(
ALLELE_REF_DEPTH = dplyr::case_when(
REF == "C" ~ C_DEPTH,
REF == "A" ~ A_DEPTH,
REF == "T" ~ T_DEPTH,
REF == "G" ~ G_DEPTH
),
ALLELE_ALT_DEPTH = dplyr::case_when(
ALT == "C" ~ C_DEPTH,
ALT == "A" ~ A_DEPTH,
ALT == "T" ~ T_DEPTH,
ALT == "G" ~ G_DEPTH
)
)
# temp <- tidy.data %>%
# dplyr::group_by(M_SEQ) %>%
# dplyr::summarise(
# A_DEPTH = sum(A_DEPTH, na.rm = TRUE),
# C_DEPTH = sum(C_DEPTH, na.rm = TRUE),
# G_DEPTH = sum(G_DEPTH, na.rm = TRUE),
# T_DEPTH = sum(T_DEPTH, na.rm = TRUE)
# ) %>%
# radiator::rad_long(cols = "M_SEQ", names_to = "ACGT_DEPTH", values_to = "DEPTH", tidy = TRUE) %>%
# dplyr::arrange(M_SEQ, -DEPTH) %>%
# dplyr::group_by(M_SEQ) %>%
# dplyr::slice_head(n = 2) %>%
# dplyr::ungroup() #%>% dplyr::bind_rows()
#
#
# # check that we have 2
# n.row <- nrow(temp)
# check <- length(unique(temp$M_SEQ))
#
# if (check != n.row / 2) {
# rlang::abort("Contact author problem with Allele depth")
# }
#
# temp %<>%
# dplyr::mutate(
# NUCLEOTIDE = stringi::stri_sub(ACGT_DEPTH, from = 1, to = 1),
# ACGT_DEPTH = NULL,
# ALLELE = rep(x = c("REF", "ALT"), times = n.row / 2),
# ALLELE_DEPTH = rep(x = c("ALLELE_REF_DEPTH", "ALLELE_ALT_DEPTH"), times = n.row / 2)
# ) %>%
# dplyr::group_by(M_SEQ) %>%
# radiator::rad_wide(names_from = c("ALLELE", "ALLELE_DEPTH"), values_from = c("NUCLEOTIDE", "DEPTH"), tidy = TRUE) %>%
# dplyr::rename(REF = NUCLEOTIDE_REF_ALLELE_REF_DEPTH, ALT = NUCLEOTIDE_ALT_ALLELE_ALT_DEPTH, ALLELE_REF_DEPTH = DEPTH_REF_ALLELE_REF_DEPTH, ALLELE_ALT_DEPTH = DEPTH_ALT_ALLELE_ALT_DEPTH)
#
#
# tidy.data %<>% dplyr::left_join(temp, by = "M_SEQ")
# n.row <- check <- temp <- NULL
}#catg.depth
if (all(rlang::has_name(tidy.data, c("ALLELE_REF_DEPTH", "ALLELE_ALT_DEPTH", "READ_DEPTH")))) {
stacks.ad <- tidy.data %>%
dplyr::select(READ_DEPTH, ALLELE_REF_DEPTH, ALLELE_ALT_DEPTH) %>%
dplyr::filter(!is.na(READ_DEPTH)) %>%
dplyr::filter(is.na(ALLELE_REF_DEPTH) & is.na(ALLELE_ALT_DEPTH))
stacks.prob <- nrow(stacks.ad)
if (stacks.prob > 0) {
non.missing.gt <- length(tidy.data$GT_BIN[!is.na(tidy.data$GT_BIN)])
stacks.ad.prop <- round(stacks.prob / non.missing.gt, 4)
message("\n\nStacks problem detected")
message(" missing allele depth info")
message(" number of genotypes with problem: ", stacks.prob, " (prop: ", stacks.ad.prop,")")
message(" correcting problem by adding the read depth info into AD fields...\n\n")
stacks.ad <- dplyr::select(tidy.data, GT_BIN) %>%
dplyr::bind_cols(dplyr::select(tidy.data, READ_DEPTH, ALLELE_REF_DEPTH, ALLELE_ALT_DEPTH)) %>%
dplyr::mutate(COL_ID = seq(1, n())) %>%
dplyr::filter(!is.na(READ_DEPTH)) %>%
dplyr::filter(is.na(ALLELE_REF_DEPTH) & is.na(ALLELE_ALT_DEPTH)) %>%
dplyr::mutate(
ALLELE_REF_DEPTH = dplyr::if_else(GT_BIN == 0, READ_DEPTH, ALLELE_REF_DEPTH),
ALLELE_ALT_DEPTH = dplyr::if_else(GT_BIN == 2, READ_DEPTH, ALLELE_ALT_DEPTH)
)
tidy.data$ALLELE_REF_DEPTH[stacks.ad$COL_ID] <- stacks.ad$ALLELE_REF_DEPTH
tidy.data$ALLELE_ALT_DEPTH[stacks.ad$COL_ID] <- stacks.ad$ALLELE_ALT_DEPTH
}
stacks.prob <- stacks.prob <- NULL
}
}
# re-calibration of ref/alt alleles ------------------------------------------
if (calibrate.alleles) {
tidy.data %<>% radiator::calibrate_alleles(data = ., biallelic = biallelic, verbose = verbose) %$% input
}
# Join back the info ---------------------------------------------------------
tidy.data %<>%
join_rad(
x = .,
s = extract_individuals_metadata(gds = data, whitelist = TRUE),
m = extract_markers_metadata(gds = data, whitelist = TRUE),
g = NULL,
env.arg = rlang::current_env()
)
filename.rad <- generate_filename(
path.folder = tidy.folder,
extension = "rad")
write_rad(data = tidy.data, path = filename.rad$filename)
if (verbose) message("Updating GDS with genotypes.meta values")
update_radiator_gds(gds = data, node.name = "genotypes.meta", value = tidy.data)
message("\nTidy data file written: ", filename.rad$filename.short)
if (verbose) message("Closing GDS file connection")
SeqArray::seqClose(data) # close the connection
return(tidy.data)
}#End tidy_vcf
# Internal nested Function -----------------------------------------------------
# write_vcf---------------------------------------------------------------------
# write a vcf file from a tidy data frame
#' @name write_vcf
#' @title Write a vcf file from a tidy data frame
#' @description Write a vcf file (file format version 4.3, see details below)
#' from a tidy data frame.
#' Used internally in \href{https://github.com/thierrygosselin/radiator}{radiator}
#' and might be of interest for users.
#' @param data A tidy data frame object in the global environment or
#' a tidy data frame in wide or long format in the working directory.
#' \emph{How to get a tidy data frame ?}
#' Look into \pkg{radiator} \code{\link{tidy_genomic_data}}.
#' @param source source of vcf
#' @param empty generate an empty vcf
#' @param pop.info (optional, logical) Should the population information be
#' included in the FORMAT field (along the GT info for each samples ?). To make
#' the VCF population-ready use \code{pop.info = TRUE}. The population information
#' must be included in the \code{POP_ID} column of the tidy dataset.
#' Default: \code{pop.info = FALSE}. Experimental.
#' @param filename (optional) The file name prefix for the vcf file
#' written to the working directory. With default: \code{filename = NULL},
#' the date and time is appended to \code{radiator_vcf_file_}.
#' @details \strong{VCF file format version:}
#'
#' If you need a different file format version than the current one, just change
#' the version inside the newly created VCF, that should do the trick.
#' \href{https://vcftools.github.io/specs.html}{For more
#' information on Variant Call Format specifications}.
#' @export
#' @rdname write_vcf
#' @references Danecek P, Auton A, Abecasis G et al. (2011)
#' The variant call format and VCFtools.
#' Bioinformatics, 27, 2156-2158.
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
write_vcf <- function(
data,
pop.info = FALSE,
filename = NULL,
source = NULL,
empty = FALSE
) {
file.date <- format(Sys.time(), "%Y%m%d@%H%M")
if (empty) {
output <- tibble::tibble(
'#CHROM' = character(0),
POS = integer(0),
ID = integer(0),
REF = character(0),
ALT = character(0),
QUAL = character(0),
FILTER = character(0),
INFO = character(0),
FORMAT = character(0)
)
} else {
# Import data ---------------------------------------------------------------
if (is.vector(data)) data <- radiator::tidy_wide(data = data, import.metadata = TRUE)
# REF/ALT Alleles and VCF genotype format ------------------------------------
if (!tibble::has_name(data, "GT_VCF")) {
data %<>% radiator::calibrate_alleles(data = ., gt.vcf = TRUE, gt.vcf.nuc = TRUE) %$% input
}
# Include CHROM, LOCUS, POS --------------------------------------------------
if (!tibble::has_name(data, "CHROM")) {
data %<>%
dplyr::mutate(
CHROM = rep("1", n()),
LOCUS = MARKERS,
POS = MARKERS
)
}
# Order/sort by pop and ind --------------------------------------------------
if (tibble::has_name(data, "POP_ID")) {
data %<>% dplyr::arrange(POP_ID, INDIVIDUALS)
} else {
data %<>% dplyr::arrange(INDIVIDUALS)
}
id.string <- unique(data$INDIVIDUALS)# keep to sort vcf columns
# Remove the POP_ID column ---------------------------------------------------
if (tibble::has_name(data, "POP_ID") && (!pop.info)) data %<>% dplyr::select(-POP_ID)
# Info field -----------------------------------------------------------------
info.field <- suppressWarnings(
dplyr::select(.data = data, MARKERS, GT_VCF) %>%
dplyr::filter(GT_VCF != "./.") %>%
dplyr::count(x = ., MARKERS) %>%
dplyr::mutate(INFO = stringi::stri_join("NS=", n, sep = "")) %>%
dplyr::select(-n)
)
# VCF body ------------------------------------------------------------------
GT_VCF_POP_ID <- NULL
if (pop.info) {
output <- suppressWarnings(
dplyr::left_join(data, info.field, by = "MARKERS") %>%
dplyr::select(MARKERS, CHROM, LOCUS, POS, REF, ALT, INFO, INDIVIDUALS, GT_VCF, POP_ID) %>%
dplyr::mutate(GT_VCF_POP_ID = stringi::stri_join(GT_VCF, POP_ID, sep = ":")) %>%
dplyr::select(-c(GT_VCF, POP_ID)) %>%
radiator::rad_wide(
x = .,
formula = "MARKERS + CHROM + LOCUS + POS + INFO + REF + ALT ~ INDIVIDUALS",
names_from = "GT_VCF_POP_ID") %>%
dplyr::mutate(
QUAL = rep(".", n()),
FILTER = rep("PASS", n()),
FORMAT = rep("GT:POP", n())
)
)
} else {
output <- suppressWarnings(
dplyr::left_join(data, info.field, by = "MARKERS") %>%
dplyr::select(MARKERS, CHROM, LOCUS, POS, REF, ALT, INDIVIDUALS, GT_VCF, INFO) %>%
radiator::rad_wide(x = ., formula = "MARKERS + CHROM + LOCUS + POS + INFO + REF + ALT ~ INDIVIDUALS", values_from = "GT_VCF") %>%
dplyr::mutate(
QUAL = rep(".", n()),
FILTER = rep("PASS", n()),
FORMAT = rep("GT", n())
)
)
}
# Transform the REF/ALT format back to A/C/G/T if 001, 002, etc is found
ref.change <- TRUE %in% unique(c("001", "002", "003", "004") %in% unique(output$REF))
if (ref.change) {
output %<>%
dplyr::mutate(
REF = dplyr::recode(REF, "A" = "001", "C" = "002", "G" = "003", "T" = "004"),
ALT = dplyr::recode(ALT, "A" = "001", "C" = "002", "G" = "003", "T" = "004")
)
}
if (tibble::has_name(output, "COL")) {
output %<>% dplyr::mutate(LOCUS = stringi::stri_join(LOCUS, COL, sep = "_"))
} else {
output %<>% dplyr::mutate(LOCUS = stringi::stri_join(LOCUS, as.numeric(POS) - 1, sep = "_"))
}
# Keep the required columns
output %<>%
dplyr::arrange(CHROM, LOCUS, POS) %>%
dplyr::select(-MARKERS) %>%
dplyr::select('#CHROM' = CHROM, POS, ID = LOCUS, REF, ALT, QUAL, FILTER, INFO, FORMAT, id.string)
}
# Filename ------------------------------------------------------------------
if (is.null(filename)) {
# Get date and time to have unique filenaming
filename <- stringi::stri_join("radiator_vcf_file_", file.date, ".vcf")
} else {
filename <- stringi::stri_join(filename, ".vcf")
}
# File format ----------------------------------------------------------------
readr::write_delim(
x = tibble::tibble("##fileformat=VCFv4.3"),
file = filename,
delim = " ",
append = FALSE,
col_names = FALSE
)
# File date ------------------------------------------------------------------
x <- paste0("##fileDate=", file.date)
readr::write_delim(
x = tibble::tibble(x),
file = filename,
delim = " ",
append = TRUE,
col_names = FALSE
)
# Source ---------------------------------------------------------------------
if (is.null(source)) {
source <- stringi::stri_join("##source=radiator_v.",
as.character(utils::packageVersion("radiator")))
readr::write_delim(
x = tibble::tibble(source),
file = filename,
delim = " ",
append = TRUE,
col_names = FALSE)
} else {
readr::write_delim(
x = tibble::tibble(
stringi::stri_join('##source=', rlang::quo_name(source))
),
file = filename,
delim = " ",
append = TRUE,
col_names = FALSE
)
}
# Info field 1 ---------------------------------------------------------------
info1 <- as.data.frame('##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">')
utils::write.table(x = info1, file = filename, sep = " ", append = TRUE, col.names = FALSE, row.names = FALSE, quote = FALSE)
# Format field 1 -------------------------------------------------------------
format1 <- '##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">'
format1 <- as.data.frame(format1)
utils::write.table(x = format1, file = filename, sep = " ", append = TRUE, col.names = FALSE, row.names = FALSE, quote = FALSE)
# Format field 2 ---------------------------------------------------------------
if (pop.info && !empty) {
format2 <- as.data.frame('##FORMAT=<ID=POP_ID,Number=1,Type=Character,Description="Population identification of Sample">')
utils::write.table(x = format2, file = filename, sep = " ", append = TRUE, col.names = FALSE, row.names = FALSE, quote = FALSE)
}
# Write the prunned vcf to the file ------------------------------------------
suppressWarnings(readr::write_tsv(x = output, file = filename, append = TRUE, col_names = TRUE))
}# end write_vcf
# extract_individuals_vcf-------------------------------------------------------
#' @title Extract individuals from vcf file
#' @description Function that returns the individuals present in a vcf file.
#' Useful to create a strata file or
#' to make sure you have the right individuals in your VCF.
#' @param vcf (character, path) The path to the vcf file.
#' @rdname extract_individuals_vcf
#' @export
#' @return A tibble with a column: \code{INDIVIDUALS}.
#' @seealso \pkg{radiator} \code{\link{read_strata}}
#' @author Thierry Gosselin \email{thierrygosselin@icloud.com}
#'
#' @examples
#' \dontrun{
#' # Built a strata file:
#' strata <- radiator::extract_individuals_vcf("my.vcf") %>%
#' dplyr::mutate(STRATA = "fill this") %>%
#' readr::write_tsv(x = ., file = "my.new.vcf.strata.tsv")
#' }
extract_individuals_vcf <- function(vcf) {
id <- tibble::tibble(
INDIVIDUALS = SeqArray::seqVCF_Header(vcf.fn = vcf, getnum = TRUE)$sample.id
)
return(id)
}#End extract_individuals_vcf
# extract_info_vcf-------------------------------------------------------
#' @title extract_info_vcf
#' @description Extract vcf information
#' @rdname extract_info_vcf
#' @keywords internal
#' @export
extract_info_vcf <- function(vcf) {
vcf.info <- SeqArray::seqVCF_Header(vcf.fn = vcf, getnum = TRUE)
res <- list(
vcf.source = vcf.info$header$value[2],
n.ind = vcf.info$num.sample,
n.markers = vcf.info$num.variant,
sample.id = vcf.info$sample.id
)
return(res)
}#End extract_info_vcf
# check_header_source_vcf-------------------------------------------------------
#' @title Check the vcf header and detect vcf source
#' @description Check the vcf header and detect vcf source
#' @rdname check_header_source_vcf
#' @keywords internal
#' @export
check_header_source_vcf <- function(vcf) {
# samtools problem encountered 20200918: PL now set to .
check.header <- SeqArray::seqVCF_Header(vcf.fn = vcf)
problematic.id <- c("AD", "AO", "QA", "GL", "CATG", "RO", "QR", "MIN_DP", "PL", "DPR")
problematic.id <-
purrr::keep(
.x = problematic.id,
.p = problematic.id %in% check.header$format$ID
)
for (p in problematic.id) {
check.header$format[check.header$format$ID == p, "Number"] <- "."
}
# check.header$format
# DArT vcf problem
probl.dart <- check.header$format[check.header$format$ID == "GT", "Type"] == "Integer"
if (length(probl.dart) == 0) probl.dart <- FALSE
if (probl.dart) check.header$format[check.header$format$ID == "GT", "Type"] <- "String"
check.source <- check.header$header$value[check.header$header$id == "source"]
if (length(check.source) == 0) {
is.stacks <- FALSE
check.source <- "unknown"
} else {
is.stacks <- stringi::stri_detect_fixed(str = check.source, pattern = "Stacks")
}
dirty.freebayes <- FALSE
if (stringi::stri_detect_fixed(str = check.source, pattern = "freeBayes") &&
stringi::stri_detect_fixed(str = check.source, pattern = "dirty")) {
message("\n\nIMPORTANT: VCF from freeBayes Dirty version: only GT and DP fields will be kept...\n\n")
check.header$format <- dplyr::filter(check.header$format, ID %in% c("GT", "DP"))
# Add this for problematic freeBayes...
prob.info <- c("AN", "DP", "DPB", "EPPR", "GTI", "MQMR", "NS", "NUMALT", "ODDS", "PAIREDR", "PQR", "PRO", "QR", "RO", "RPPR", "SRF", "SRP", "SRR")
prob.info <-
purrr::keep(
.x = prob.info,
.p = prob.info %in% check.header$info$ID
)
for (p in prob.info) {
check.header$info[check.header$info$ID == p, "Number"] <- "."
}
}
if (is.stacks) {
stacks.2 <- keep.stacks.gl <- stringi::stri_detect_fixed(
str = check.source,
pattern = "Stacks v2")
keep.stacks.gl <- TRUE
if (!keep.stacks.gl) {
check.header$format <- dplyr::filter(check.header$format, ID != "GL")
}
markers.info <- NULL
overwrite.metadata <- NULL
} else {
stacks.2 <- FALSE
markers.info <- NULL
overwrite.metadata <- NULL
}
return(
res = list(
data.source = stacks.2,
check.header = check.header,
markers.info = markers.info,
overwrite.metadata = overwrite.metadata
)
)
}#End check_header_source_vcf
## Split vcf--------------------------------------------------------------------
#' @title Split a VCF file
#' @description This function allows to split a VCF file in several VCFs,
#' based on individuals or populations.
#' @param strata A file identical to the strata file usually used in radiator,
#' with an additional column named: \code{SPLIT}.
#' This new column contains numerical values
#' (e.g. 1, 1, 1, ..., 2, 2, 2, 2, ..., 3, 3, ...),
#' that indicate for each INDIVIDUALS/STRATA, how to split.
#' The number of VCF to split to is based on the max value found in the column
#' \code{SPLIT}, above this would result in 3 VCF files created).
#' @inheritParams tidy_genomic_data
#' @inheritParams radiator_common_arguments
#' @return The function returns in the global environment a list with
#' the different tidy dataset from the split vcf. In the working directory,
#' the splitted VCF files with \code{"_1", "_2"} in the name.
#' @examples
#' \dontrun{
#' split.data <- radiator::split_vcf(
#' data = "batch_1.vcf",
#' strata = "strata.split.tsv",
#' blacklist.id = "blacklisted.id.txt",
#' whitelist.markers = "whitelist.loci.txt")
#' }
#' @keywords internal
#' @rdname split_vcf
#' @export
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
split_vcf <- function(
data,
strata,
parallel.core = parallel::detectCores() - 1,
...
) {
cat("#######################################################################\n")
cat("########################### radiator::split_vcf #########################\n")
cat("#######################################################################\n")
timing <- proc.time()
# manage missing arguments -----------------------------------------------------
if (missing(data)) rlang::abort("data/vcf file missing")
if (missing(strata)) rlang::abort("strata file missing")
# if (!is.null(pop.levels) & is.null(pop.labels)) {
# pop.levels <- stringi::stri_replace_all_fixed(
# pop.levels, pattern = " ", replacement = "_", vectorize_all = FALSE)
# pop.labels <- pop.levels
# }
# if (!is.null(pop.labels) & is.null(pop.levels)) rlang::abort("pop.levels is required if you use pop.labels")
# if (!is.null(pop.labels)) {
# if (length(pop.labels) != length(pop.levels)) rlang::abort("pop.labels and pop.levels must have the same length (number of groups)")
# pop.labels <- stringi::stri_replace_all_fixed(pop.labels, pattern = " ", replacement = "_", vectorize_all = FALSE)
# }
# Filename -------------------------------------------------------------------
# Get date and time to have unique filenaming
file.date <- format(Sys.time(), "%Y%m%d@%H%M")
filename <- stringi::stri_join("radiator_split_vcf_", file.date)
# import data ----------------------------------------------------------------
# if (!is.null(blacklist.id)) {# With blacklist of ID
# if (is.vector(blacklist.id)) {
# suppressMessages(blacklist <- readr::read_tsv(blacklist.id, col_names = TRUE))
# } else {
# if (!tibble::has_name(blacklist.id, "INDIVIDUALS")) {
# rlang::abort("Blacklist of individuals should have 1 column named: INDIVIDUALS")
# }
# blacklist <- blacklist.id
# }
# blacklist$INDIVIDUALS <- stringi::stri_replace_all_fixed(
# str = blacklist$INDIVIDUALS,
# pattern = c("_", ":"),
# replacement = c("-", "-"),
# vectorize_all = FALSE
# )
#
# # remove potential duplicate id
# blacklist <- dplyr::distinct(.data = blacklist, INDIVIDUALS)
# }
split <- suppressMessages(readr::read_tsv(file = strata))
strata <- dplyr::select(split, -SPLIT)
split <- dplyr::select(split, -STRATA)
# if (!is.null(blacklist.id)) {
# split <- dplyr::anti_join(x = split, y = blacklist, by = "INDIVIDUALS")
# }
split$INDIVIDUALS <- stringi::stri_replace_all_fixed(
str = split$INDIVIDUALS,
pattern = c("_", ":"),
replacement = c("-", "-"),
vectorize_all = FALSE
)
# Function required
split_vcf <- carrier::crate(function(data, filename) {
split.id <- unique(data$SPLIT)
filename <- stringi::stri_join(filename, "_", split.id)
radiator::write_vcf(data = dplyr::select(data, -SPLIT),
pop.info = FALSE, filename = filename)
})#End split_vcf
message("Importing and tidying the vcf...")
input <- suppressMessages(
radiator::tidy_genomic_data(
data = data,
strata = strata,
vcf.metadata = FALSE,
whitelist.markers = whitelist.markers,
filename = NULL,
verbose = FALSE) %>%
dplyr::full_join(split, by = "INDIVIDUALS")
)
split <- strata <- blacklist <- NULL
radiator_future(
.x = input,
.f = split_vcf,
split.vec = FALSE,
split.with = "SPLIT",
flat.future = "walk",
parallel.core = parallel.core,
filename = filename
)
# results --------------------------------------------------------------------
message("Split VCFs were written in the working directory")
timing <- proc.time() - timing
message("\nComputation time: ", round(timing[[3]]), " sec")
cat("############################## completed ##############################\n")
return(input)
}#End split_vcf
# Merge vcf---------------------------------------------------------------------
# @title Merge VCF files
# @description This function allows to merge 2 VCF files.
# @param vcf1 First VCF file.
# @param strata1 strata file for vcf1.
# @param vcf2 Second VCF file.
# @param strata2 strata file for vcf2.
# @param filename Name of the merged VCF file.
# With the default, the function gives a filename based on date and time.
# Default: \code{filename = NULL}.
# @inheritParams tidy_genomic_data
# @return The function returns in the global environment a tidy dataset with
# the merged VCF files and the merged VCF in the working directory.
# @examples
# \dontrun{
# # The simplest way to run the function:
# sum <- radiator::merge_vcf(
# vcf1 = "batch_1.vcf", strata1 = "strata1_brook_charr.tsv",
# vcf1 = "batch_2.vcf", strata2 = "strata2_brook_charr.tsv",
# pop.select = c("QC", "ON", "NE"),
# maf.thresholds = c(0.002, 0.001),
# maf.pop.num.threshold = 1,
# maf.approach = "SNP",maf.operator = "OR",
# filename = "my_new_VCF.vcf"
# }
# @rdname merge_vcf
# @export
# @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
# merge_vcf <- function(
# vcf1, strata1,
# vcf2, strata2,
# whitelist.markers = NULL,
# filename = NULL,
# parallel.core = parallel::detectCores() - 1
# ) {
# cat("#######################################################################\n")
# cat("########################### radiator::merge_vcf #########################\n")
# cat("#######################################################################\n")
# timing <- proc.time()
#
# # manage missing arguments
# if (missing(vcf1)) rlang::abort("vcf1 file missing")
# if (missing(vcf2)) rlang::abort("vcf2 file missing")
# if (missing(strata1)) rlang::abort("strata1 file missing")
# if (missing(strata2)) rlang::abort("strata2 file missing")
#
#
# # if (!is.null(pop.levels) & is.null(pop.labels)) {
# # pop.levels <- stringi::stri_replace_all_fixed(
# # pop.levels, pattern = " ", replacement = "_", vectorize_all = FALSE)
# # pop.labels <- pop.levels
# # }
# # if (!is.null(pop.labels) & is.null(pop.levels)) rlang::abort("pop.levels is required if you use pop.labels")
# # if (!is.null(pop.labels)) {
# # if (length(pop.labels) != length(pop.levels)) rlang::abort("pop.labels and pop.levels must have the same length (number of groups)")
# # pop.labels <- stringi::stri_replace_all_fixed(pop.labels, pattern = " ", replacement = "_", vectorize_all = FALSE)
# # }
#
# # Filename
# # Get date and time to have unique filenaming
# if (is.null(filename)) {
# file.date <- format(Sys.time(), "%Y%m%d@%H%M")
# }
#
# # import data
# message("Importing and tidying the vcf1...")
# input <- suppressMessages(radiator::tidy_genomic_data(
# data = vcf1,
# strata = strata1,
# vcf.metadata = FALSE,
# whitelist.markers = whitelist.markers,
# filename = NULL,
# verbose = FALSE))
#
# message("Importing and tidying the vcf2...")
# # Also Using pop.levels and pop.labels info if present
# input <- suppressWarnings(
# dplyr::bind_rows(
# input,
# suppressMessages(
# radiator::tidy_genomic_data(
# data = vcf2,
# strata = strata2,
# vcf.metadata = FALSE,
# whitelist.markers = whitelist.markers,
# filename = NULL,
# verbose = FALSE))))
#
# message("Adjusting REF/ALT alleles...")
# input <- radiator::calibrate_alleles(
# data = input,
# parallel.core = parallel.core,
# verbose = TRUE)$input
#
# # if (filter.monomorphic) {
# # input <- radiator::filter_monomorphic(data = input, verbose = TRUE)
# # }
# #
# # if (filter.common.markers) {
# # input <- radiator::filter_common_markers(data = input, verbose = TRUE)$input
# # }
# #
# # if (!is.null(filter.ma)) {
# # input <- filter_maf(
# # data = input,
# # interactive.filter = FALSE,
# # filter.ma = filter.ma,
# # parallel.core = parallel.core,
# # verbose = FALSE)$tidy.filtered.mac
# # }
#
# # Write VCF in the working directory
# radiator::write_vcf(data = input, pop.info = FALSE, filename = filename)
#
# # results
# message("Merged VCF in the working directory: ", filename, ".vcf")
# timing <- proc.time() - timing
# message("\nComputation time: ", round(timing[[3]]), " sec")
# cat("############################## completed ##############################\n")
# return(input)
# }#End merge_vcf
# vcf_strata -------------------------------------------------------------------
#' @name vcf_strata
#' @title Join stratification metadata to a VCF (population-aware VCF)
#' @description Include stratification metadata, e.g. population-level information,
#' to the \code{FORMAT} field of a VCF file.
#' @param data A VCF file
#' @param strata (optional) A tab delimited file at least 2 columns
#' with header:
#' \code{INDIVIDUALS} and \code{STRATA}.
#' The \code{STRATA} and any other columns can be any hierarchical grouping.
#' To create a strata file see \code{\link[radiator]{individuals2strata}}.
#' @param filename (optional) The file name for the modifed VCF,
#' written to the working directory. Default: \code{filename = NULL} will make a
#' custom filename with data and time.
#' @export
#' @rdname vcf_strata
#' @return A VCF file in the working directory with new \code{FORMAT} field(s)
#' correponding to the strata column(s).
#' @seealso
#' \href{https://vcftools.github.io}{VCF web page}
#'
#' \href{VCF specification page}{https://vcftools.github.io/specs.html}
#' @references Danecek P, Auton A, Abecasis G et al. (2011)
#' The variant call format and VCFtools.
#' Bioinformatics, 27, 2156-2158.
#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
#' @keywords internal
vcf_strata <- function(data, strata, filename = NULL) {
# data <- "batch_1.vcf"
# strata <- "strata.sturgeon.12pop.tsv"
# filename <- NULL
# data <- "example_vcf2dadi_ferchaud_2015.vcf"
# strata <- "strata.stickleback.tsv"
cat("#######################################################################\n")
cat("######################### radiator: vcf_strata ##########################\n")
cat("#######################################################################\n")
# Checking for missing and/or default arguments ******************************
if (missing(data)) rlang::abort("Input file missing")
if (missing(strata)) rlang::abort("Strata file missing")
# import the first 50 lines
quick.scan <- readr::read_lines(file = data, n_max = 75)
# Function to detect where CHROM line starts
detect_vcf_header <- function(x) {
stringi::stri_detect_fixed(str = x, pattern = "CHROM", negate = FALSE)
}
# Detect the index
max.vcf.header <- purrr::detect_index(.x = quick.scan, .p = detect_vcf_header) - 1
# import VCF header and add a row containing the new format field
vcf.header <- readr::read_delim(file = data, n_max = max.vcf.header, col_names = "VCF_HEADER", delim = "\n")
# import the vcf file, no filters etc.
message("Importing the VCF file")
input <- data.table::fread(
input = data,
sep = "\t",
stringsAsFactors = FALSE,
header = TRUE,
skip = "CHROM",
showProgress = TRUE,
verbose = FALSE
) %>%
tibble::as_tibble()
# transform in long format
input %<>%
radiator::rad_long(
x = .,
cols = c("#CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", "FORMAT"),
names_to = "INDIVIDUALS",
values_to = "FORMAT_ID",
variable_factor = FALSE
) %>%
dplyr::mutate(
INDIVIDUALS = stringi::stri_replace_all_fixed(
str = INDIVIDUALS,
pattern = c("_", ":"),
replacement = c("-", "-"),
vectorize_all = FALSE)
)
# population levels and strata ---------------------------------------------
message("Importing the strata file")
if (is.vector(strata)) {
# message("strata file: yes")
number.columns.strata <- max(utils::count.fields(strata, sep = "\t"))
col.types <- stringi::stri_join(rep("c", number.columns.strata), collapse = "")
strata.df <- readr::read_tsv(file = strata, col_names = TRUE, col_types = col.types) %>%
dplyr::rename(POP_ID = STRATA)
} else {
# message("strata object: yes")
colnames(strata) <- stringi::stri_replace_all_fixed(
str = colnames(strata),
pattern = "STRATA",
replacement = "POP_ID",
vectorize_all = FALSE
)
strata.df <- strata
}
strata.number <- length(strata.df) - 1
strata.colnames <- purrr::discard(.x = colnames(strata.df), .p = colnames(strata.df) %in% "INDIVIDUALS")
# Replace unwanted whitespace pattern in the strata
strata.df <- strata.df %>%
dplyr::mutate(
INDIVIDUALS = stringi::stri_replace_all_fixed(
str = INDIVIDUALS,
pattern = c("_", ":"),
replacement = c("-", "-"),
vectorize_all = FALSE
),
POP_ID = stringi::stri_replace_all_fixed(
str = POP_ID,
pattern = " ",
replacement = "_",
vectorize_all = FALSE
)
)
# Join strata to input and merge strata column to FORMAT field
message("Joining the strata to the VCF into new field format...")
input <- input %>%
dplyr::left_join(strata.df, by = "INDIVIDUALS") %>%
tidyr::unite_(
data = .,
col = "FORMAT_ID",
from = c("FORMAT_ID", strata.colnames),
sep = ":",
remove = TRUE
) %>%
radiator::rad_wide(
x = .,
formula = "`#CHROM` + POS + ID + REF + ALT + QUAL + FILTER + INFO + FORMAT ~ INDIVIDUALS",
values_from = "FORMAT_ID") %>%
dplyr::mutate(
FORMAT = stringi::stri_join(
FORMAT,
stringi::stri_join(strata.colnames, collapse = ":"),
sep = ":",
collapse = NULL
)
)
# Filename ------------------------------------------------------------------
message("Writing to the working directory...")
if (is.null(filename)) {
# Get date and time to have unique filenaming
file.date <- stringi::stri_replace_all_fixed(Sys.time(), pattern = " EDT", replacement = "", vectorize_all = FALSE)
file.date <- stringi::stri_replace_all_fixed(file.date, pattern = c("-", " ", ":"), replacement = c("", "@", ""), vectorize_all = FALSE)
file.date <- stringi::stri_sub(file.date, from = 1, to = 13)
filename <- stringi::stri_join("radiator_vcf_file_", file.date, ".vcf")
} else {
filename <- stringi::stri_join(filename, ".vcf")
}
# File format ----------------------------------------------------------------
# write_delim(x = data_frame("##fileformat=VCFv4.2"), file = filename, delim = " ", append = FALSE, col_names = FALSE)
vcf.header[1,] <- "##fileformat=VCFv4.3"
# File date ------------------------------------------------------------------
file.date <- stringi::stri_replace_all_fixed(Sys.Date(), pattern = "-", replacement = "")
file.date <- stringi::stri_join("##fileDate=", file.date, sep = "")
# write_delim(x = data_frame(file.date), file = filename, delim = " ", append = TRUE, col_names = FALSE)
vcf.header[2,] <- file.date
# Source ---------------------------------------------------------------------
# write_delim(x = data_frame(stringi::stri_join("##source=radiator_v.", utils::packageVersion("radiator"))), file = filename, delim = " ", append = TRUE, col_names = FALSE)
# vcf.header[3,] <- stringi::stri_replace_all_fixed(str = vcf.header[3,], pattern = '"', replacement = "", vectorize_all = FALSE)
# vcf.header[3,]<- stringi::stri_join(vcf.header[3,], "and radiator v.", utils::packageVersion("radiator"))
# New FORMAT -----------------------------------------------------------------
for (i in strata.colnames) {
vcf.header <- tibble::add_row(
.data = vcf.header,
VCF_HEADER = stringi::stri_join(
"##FORMAT=<ID=", i, ',Number=1,Type=Character,Description="New strata",Source="radiator",Version="', utils::packageVersion("radiator"), '">')
)
}
# VCF HEADER ------------------------------------------------------------------
utils::write.table(x = vcf.header, file = filename, sep = " ", append = FALSE, col.names = FALSE, quote = FALSE, row.names = FALSE)
# Write the data -------------------------------------------------------------
suppressWarnings(readr::write_tsv(x = input, file = filename, append = TRUE, col_names = TRUE))
cat("############################## completed ##############################\n")
}#vcf_strata
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.