R/annotate_deaminations.R

Defines functions annotate_deaminations write_vcf

Documented in annotate_deaminations write_vcf

#' Annotate deaminations in input vcf file
#'
#' \code{write_vcf} is a special case of \code{\link{annotate_deaminations}}. It exports ideafix variant classification results to the vcf file ideafix has been run over.
#'
#' @param classification tibble containing the classification generated by \code{\link{classify_variants}}. It is the object to be written.
#' @param vcf_filename character string naming the path to the input vcf, i.e. the vcf file containing the variants ideafix has been run over.
#' @param outfolder character string naming the folder to write the file to. Defaults to current working directory (\code{getwd}).
#' @param outname character string naming the output filename. Defaults to "ideafix_labels.vcf".
#'
#' @return None
#' @importFrom utils write.table
#' @details The object to be written corresponds to a data frame or tibble with the following columns: CHROM, POS, REF, ALT, DEAM_SCORE, DEAMINATION. CHROM and POS identify the variant position, REF and ALT describe the reference and alternate alleles. DEAM_SCORE equals to the deamination score yielded by the selected classification algorithm (RF or XGBoost). Note that these values should not be interpreted as ordinary probabilities. DEAMINATION contains the label ideafix has assigned to the variant based on an optimized classification threshold.
#'
#' The output vcf file is the result of adding DEAM_SCORE and DEAMINATION values as a new annotation in the INFO column of the vcf file ideafix has been run over. Note that not all variants in the file will have this new annotation, but only those C:G > T:A variants (as these are the only ones susceptible of being cytosine deaminations).
#'
#' @examples
write_vcf <- function(classification, vcf_filename, outfolder = ".", outname = "ideafix_labels") {
  # Write classification to a temporary tab-delimited file
  tmpdir <- "/tmp"
  tmp_filename <- file.path(tmpdir, "ideafix_preds.temp")
  write.table(classification, file = tmp_filename, sep = "\t", row.names = FALSE, quote = FALSE)
  # Write header tags
  header_text <- '##INFO=<ID=DEAM_SCORE,Number=1,Type=Float,Description="Deamination score of the variant.">\n##INFO=<ID=DEAMINATION,Number=1,Type=String,Description="Classification of the variant as deamination or non-deamination.">'
  hdr_filename <- file.path(tmpdir, "ideafix_preds.hdr")
  writeLines(header_text, con = hdr_filename)
  # Compress and index the file
  compress_cmd <- sprintf("tail -n +2 %s | bgzip -c > %s.gz", tmp_filename, tmp_filename)
  system(compress_cmd, intern = FALSE)
  idx_cmd <- sprintf("tabix -p vcf %s.gz", tmp_filename)
  system(idx_cmd, intern = FALSE)
  # Annotate the vcf file
  annotate_cmd <- sprintf("bcftools annotate -a %s.gz -h %s -c CHROM,POS,REF,ALT,DEAM_SCORE,DEAMINATION -o %s.vcf %s", tmp_filename, hdr_filename, file.path(outfolder, outname), vcf_filename)
  system(annotate_cmd, intern = FALSE)
}

#' Annotate deaminations to file
#'
#' \code{annotate_deaminations} exports ideafix variant classification results to a text file. It either writes them to a new tsv file or appends them to the input vcf file provided to ideafix.
#'
#' @param classification tibble containing the classification generated by \code{\link{classify_variants}}. It is the object to be written.
#' @param format character string indicating the output file format. Can be "tsv" or "vcf". Defaults to "tsv".
#' @param outfolder character string naming the folder to write the file to. Defaults to current working directory (\code{getwd}).
#' @param outname character string naming the output filename. Defaults to "ideafix_labels.txt" or "ideafix_labels.vcf", depending on \code{format} value .
#' @param vcf_filename optional character string, only needed if \code{format} equals to "vcf". It indicates the path to the vcf file ideafix has been run over.
#'
#' @return None
#' @export
#' @importFrom utils write.table
#'
#' @details The object to be written corresponds to a data frame or tibble with the following columns: CHROM, POS, REF, ALT, DEAM_SCORE, DEAMINATION. CHROM and POS identify the variant position, REF and ALT describe the reference and alternate alleles. DEAM_SCORE equals to the deamination score yielded by the selected classification algorithm (RF or XGBoost). Note that these values should not be interpreted as ordinary probabilities. DEAMINATION contains the label ideafix has assigned to the variant based on an optimized classification threshold.
#'
#' The output file format \{code} defaults to "tsv" and in such case the \code{classification} object is written as it is to a tsv file. Instead, if \{code} is set to "vcf", DEAM_SCORE and DEAMINATION values are added as a new annotation in the INFO column of the vcf file ideafix has been run over. Note that not all variants in the file will have this new annotation, but only those C:G > T:A variants (as these are the only ones susceptible of being cytosine deaminations).
#'
#' @examples
annotate_deaminations <- function(classification, format = "tsv", outfolder = ".", outname = "ideafix_labels", vcf_filename = NULL) {
  if (format == "tsv") {
    message(sprintf("Writing tsv file to %s.txt", file.path(outfolder, outname)))
    write.table(classification, file = file.path(outfolder, sprintf("%s.txt", outname)), sep = "\t", row.names = FALSE, quote = FALSE)
  }
  else if (format == "vcf") {
    if (is.null(vcf_filename)) {
      stop("Path to the vcf file ideafix has been run over must be provided.")
    } else {
      message(sprintf("Writing vcf file to %s.vcf", file.path(outfolder, outname)))
      write_vcf(vcf_filename = vcf_filename, classification = classification, outfolder = outfolder, outname = outname)
    }
  } else {
    warning("Export file format not recognized. Results will be exported to tsv file.")
    write.table(classification, file = file.path(outfolder, sprintf("%s.txt", outname)), sep = "\t", row.names = FALSE, quote = FALSE)
  }
}
mmaitenat/ideafix documentation built on Sept. 18, 2021, 7:55 a.m.