R/vcf2pooldata.R

Defines functions vcf2pooldata

Documented in vcf2pooldata

#' Convert a VCF file into a pooldata object.
#' @description Convert VCF files into a pooldata object.
#' @param vcf.file The name (or a path) of the Popoolation sync file (might be in compressed format)
#' @param poolsizes A numeric vector with haploid pool sizes
#' @param poolnames A character vector with the names of pool
#' @param min.rc Minimal allowed read count per base (options silenced for VarScan vcf). Bases covered by less than min.rc reads are discarded and considered as sequencing error. For instance, if nucleotides A, C, G and T are covered by respectively 100, 15, 0 and 1 over all the pools, setting min.rc to 0 will lead to discard the position (the polymorphism being considered as tri-allelic), while setting min.rc to 1 (or 2, 3..14) will make the position be considered as a SNP with two alleles A and C (the only read for allele T being disregarded). For VarScan vcf, markers with more than one alternative allele are discarded because the VarScan AD field only contains one alternate read count.
#' @param min.cov.per.pool Minimal allowed read count (per pool). If at least one pool is not covered by at least min.cov.perpool reads, the position is discarded
#' @param max.cov.per.pool Maximal allowed read count (per pool). If at least one pool is covered by more than min.cov.perpool reads, the position is discarded
#' @param min.maf Minimal allowed Minor Allele Frequency (computed from the ratio overall read counts for the reference allele over the read coverage)
#' @param remove.indels Remove indels identified using the number of characters of the alleles in the REF or ALT fields (i.e., if at least one allele is more than 1 character, the position is discarded)
#' @param min.dist.from.indels Remove SNPs within min.dist.from.indels from an indel i.e. SNP with position p verifying (indel.pos-min.dist)<=p<=(indel.pos+min.dist+l.indels-1) where l.indel=length of the ref. indel allele. If min.dist.from.indels>0, INDELS are also removed (i.e., remove.indels is set to TRUE).
#' @param nlines.per.readblock Number of Lines read simultaneously. Should be adapted to the available RAM.
#' @param verbose If TRUE extra information is printed on the terminal
#' @return A pooldata object containing 7 elements:
#' \enumerate{
#' \item "refallele.readcount": a matrix with nsnp rows and npools columns containing read counts for the reference allele (chosen arbitrarily) in each pool
#' \item "readcoverage": a matrix with nsnp rows and npools columns containing read coverage in each pool
#' \item "snp.info": a matrix with nsnp rows and four columns containing respectively the contig (or chromosome) name (1st column) and position (2nd column) of the SNP; the allele taken as reference in the refallele.readcount matrix (3rd column); and the alternative allele (4th column)
#' \item "poolsizes": a vector of length npools containing the haploid pool sizes
#' \item "poolnames": a vector of length npools containing the names of the pools
#' \item "nsnp": a scalar corresponding to the number of SNPs
#' \item "npools": a scalar corresponding to the number of pools
#' }
#' @details Genotype format in the vcf file for each pool is assumed to contain either i) an AD field containing allele counts separated by a comma (as produced by popular software such as GATK or samtools/bcftools) or ii) both a RD (reference allele count) and a AD (alternate allele count) as obtained with the VarScan mpileup2snp program (when run with the --output-vcf option). The underlying format is automatically detected by the function. For VarScan generated vcf, it should be noticed that SNPs with more than one alternate allele are discarded (because only a single count is then reported in the AD fields) making the min.rc unavailable. The VarScan --min-reads2 option might replace to some extent this functionality although SNP where the two major alleles in the Pool-Seq data are different from the reference allele (e.g., expected to be more frequent when using a distantly related reference genome for mapping) will be disregarded.
#' @examples
#'  make.example.files(writing.dir=tempdir())
#'  pooldata=vcf2pooldata(vcf.file=paste0(tempdir(),"/ex.vcf.gz"),poolsizes=rep(50,15))
#' @export
vcf2pooldata<-function(vcf.file="",poolsizes=NA,poolnames=NA,min.cov.per.pool=-1,min.rc=1,max.cov.per.pool=1e6,min.maf=-1,remove.indels=FALSE,min.dist.from.indels=0,nlines.per.readblock=1000000,verbose=TRUE){
  if(nchar(vcf.file)==0){stop("ERROR: Please provide the name of the vcf file as generated by e.g. VarScan")}
  if(sum(is.na(poolsizes))>0){stop("ERROR: Please provide a vector of Pool Sizes (poolsize argument)")}
  if(min.dist.from.indels>0){remove.indels=TRUE}
  ##### Recup info and check validity of argument
  poolsizes=as.numeric(poolsizes)
  if(verbose){cat("Reading Header lines\n")}
  file.con=file(vcf.file,open="r") 
  continue.reading=TRUE
  nlines.header=0
  while(continue.reading){
    tmp.data=scan(file=file.con,nlines = 1,what="character",quiet=TRUE,quote=NULL)
    nlines.header=nlines.header+1
    if(tmp.data[1]=="#CHROM"){continue.reading=FALSE}
    if(substr(tmp.data[1],1,1)!="#"){
      close(file.con)
      stop("ERROR: The vcf file is not valid. Could not find any header lines (i.e., starting with #CHROM)")}
  }
  npools=length(tmp.data)-9
  if(length(poolsizes)!=npools){
    close(file.con)
    stop("ERROR: The number of pools in the vcf file is different from the length of the vector of pool sizes")}
  if(sum(is.na(poolnames))>0){
    poolnames=paste0("Pool",1:npools)
  }else{
    poolnames=as.character(poolnames)
    if(length(poolnames)!=npools){
      close(file.con)
      stop("ERROR: The number of pools in the vcf file is different from the length of vector of pool names")}
  }
  continue.reading=TRUE
  ######
  ##initalize output
  #######
  res<-new("pooldata")
  res@npools=npools ; res@nsnp=0
  res@poolsizes=poolsizes ;  res@poolnames=poolnames
  res@refallele.readcount=res@readcoverage=matrix(NA,0,npools)
  snpdet=matrix(NA,0,4)
  
  #####
  #recup format (and check if vcf file is empty: i.e., no lines)
  #######
  file.con2=file(vcf.file,open="r") 
  tmp.data=scan(file=file.con2,nlines = 1,skip=nlines.header,what="character",quiet=TRUE,quote=NULL)
  close(file.con2)
  if(length(tmp.data)==0){
    continue.reading=FALSE
    cat(paste("Warning: The vcf file",vcf.file,"is empty\n"))
  }else{#Check and recup format (vcfscan or other: basically: is there AD and RD fields as in VARSCAN or just AD (as in bcftools, gatk...)
    tmp.format=unlist(strsplit(tmp.data[9],split=":"))
    ad.index=which(tmp.format=="AD") ; rd.index=which(tmp.format=="RD")
    if(length(ad.index)==0){
      close(file.con)
      stop("ERROR: No field containing allele depth (AD field) was detected in the vcf file")
    }
    if(length(rd.index)==0){
      VARSCAN=FALSE
      if(verbose){
        cat("Standard format (i.e., as in Bcftools, GATK, etc.) detected for the AD field:\n the read count for all the identified alleles are separated by a comma\n")
      }
    }else{
      VARSCAN=TRUE
      if(verbose){
        cat("VarScan like format detected for allele count data:\n the AD field contains allele depth\nfor the alternate allele and RD field for the reference allele\n(N.B., positions with more than one alternate allele will be ignored)\n")
      }
    }
  }
  ###############
  #start parsing
  ##############
  time1=proc.time()
  nlines.read=0
  if(verbose){cat("Parsing allele counts\n")}
  while(continue.reading){
    tmp.data=matrix(scan(file=file.con,nlines = nlines.per.readblock,what="character",quiet=TRUE,quote=NULL),ncol=npools+9,byrow=T) 
    tmp.nlines.read=nrow(tmp.data)
    if(tmp.nlines.read<nlines.per.readblock){continue.reading=FALSE}
    #discard monomorphic positions
    tmp.data=tmp.data[tmp.data[,5]!=".",]
    #Count the number of alleles and identify if indels or not
    tmp.allele.scan=.scan_allele_info(paste(tmp.data[,4],tmp.data[,5],sep=",")) #first column=number of alleles (including ref) and second column=1 if indel (0 otherwise)
    if(VARSCAN){
      #For Varscan vcf's marker with more than 2 alleles (i.e., alt allele field contains a comma) need to be eliminated at this stage because counts for every bases are no more available (only one of the alternate base is considered in the AD field!)
      dum.sel=tmp.allele.scan[,1]==2
      tmp.allele.scan=tmp.allele.scan[dum.sel,]
      tmp.data=tmp.data[dum.sel,]
    }
    ##traitement des indels
    if(remove.indels){
      dum.sel=tmp.allele.scan[,2]==0
      if(sum(!dum.sel)>0){
        if(min.dist.from.indels>0){
          tmp.tst=.find_indelneighbor_idx(tmp.data[,1],as.numeric(tmp.data[,2]),
                                          which(!dum.sel)-1,min.dist.from.indels,
                                          nchar(tmp.data[!dum.sel,4]))
          dum.sel[tmp.tst==1]=FALSE
        }
        tmp.allele.scan=tmp.allele.scan[dum.sel,]
        tmp.data=tmp.data[dum.sel,]
      }
    }
    nalt_all=tmp.allele.scan[,1]   ##Peut etre retirer si nall tres eleves: e.g., >5 (ca peut arriver avec les indels)
    npos=nrow(tmp.data)
    if(npos>1){
      if(VARSCAN){
        tmp.extract=.extract_vscan_counts(tmp.data[,-1:-9],ad_idx = ad.index,rd_idx = rd.index)
        tmp.Y=tmp.extract[,1:npools]
        tmp.N=tmp.extract[,(npools+1):(2*npools)]
        tmp.maf=0.5-abs(0.5-rowSums(tmp.Y)/rowSums(tmp.N))
        tmp.snpdet=tmp.data[,c(1,2,4,5)]
        rm(tmp.extract)
      }else{
        tmp.extract=.extract_nonvscan_counts(vcf_data = tmp.data[,-1:-9],ad_idx = ad.index,nb_all = nalt_all,min_rc=min.rc)
        dum.sel=tmp.extract[,2*npools+5]==0 #filtering multi-allelic SNPs/indels non passing the min.rc criterion
        tmp.Y=tmp.extract[dum.sel,1:npools] ; tmp.N=tmp.extract[dum.sel,(npools+1):(2*npools)]
        tmp.maf=0.5-abs(0.5-tmp.extract[dum.sel,2*npools+3]/rowSums(tmp.extract[dum.sel,2*npools+3:4]))#0.5-abs(0.5-rowSums(tmp.Y)/rowSums(tmp.N))
        tmp.snpdet=tmp.data[dum.sel,c(1,2,4,5)]
        #retrieving correct allele names for multi-allelic SNPs (NB: for these markers, Ref becomes the SNP with the highest count across alleles)
        dum.all.idx=tmp.extract[dum.sel,2*npools+1:2]
        dum.nball=nalt_all[dum.sel]
        dum.sel=which(dum.nball>2) #which(dum.all.idx[,1]!=1 | dum.all.idx[,2]!=2) 
        if(length(dum.sel)>0){
          if(length(dum.sel)==1){
            dum.all.idx=matrix(dum.all.idx[dum.sel,],nrow=1)
          }else{
            dum.all.idx=dum.all.idx[dum.sel,]
          }
        dum.newall=.extract_allele_names(paste(tmp.snpdet[dum.sel,3],tmp.snpdet[dum.sel,4],sep=","),dum.all.idx)
        tmp.snpdet[dum.sel,3:4]=dum.newall
        }
        rm(tmp.extract)
      }
      ###filtering according to coverage and maf criteria
      dum.sel=(rowSums(tmp.N>=min.cov.per.pool)==npools) & (rowSums(tmp.N<=max.cov.per.pool)==npools) & (tmp.maf>min.maf)
      if(sum(dum.sel)>0){
        tmp.Y=tmp.Y[dum.sel,] ; tmp.N=tmp.N[dum.sel,] ; tmp.snpdet=tmp.snpdet[dum.sel,]
        res@refallele.readcount=rbind(res@refallele.readcount,tmp.Y)
        res@readcoverage=rbind(res@readcoverage,tmp.N)
        res@nsnp=nrow(res@refallele.readcount)
        snpdet=rbind(snpdet,tmp.snpdet)
      }
      rm(tmp.Y,tmp.N,tmp.snpdet)
      nlines.read=nlines.read+tmp.nlines.read
      if(verbose){
        time.elapsed=(proc.time()-time1)[3]
        nhours=floor(time.elapsed/3600)
        nminutes=floor((time.elapsed-nhours*3600)/60)
        nseconds=round(time.elapsed-nhours*3600-nminutes*60)
        cat(nlines.read," lines processed in",nhours,"h ",nminutes, "m ",nseconds,"s :",res@nsnp,"SNPs found\n")
      }
    }
  }
  close(file.con)
  res@snp.info=data.frame(Chromosome=as.character(snpdet[,1]),
                          Position=as.numeric(snpdet[,2]),
                          RefAllele=as.character(snpdet[,3]),
                          AltAllele=as.character(snpdet[,4]),
                          stringsAsFactors = FALSE)
  rm(snpdet)
  if(verbose){cat("Data consists of",res@nsnp,"SNPs for",res@npools,"Pools\n")}
  return(res)
}

Try the poolfstat package in your browser

Any scripts or data that you put into this service are public.

poolfstat documentation built on Sept. 8, 2023, 5:49 p.m.