README.md

CNproScan

CNproScan is R package developed for CNV detection in bacterial genomes. It employs Generalized Extreme Studentized Deviate test for outliers to detect CNVs in read-depth data with discordant reads detection to annotate the CNVs type.

The not-updated Matlab version is here: https://github.com/robinjugas/CNproScanMatlab

This is the latest version v1.0 For previous versions see Tags/Releases.

Dependencies:

Package was tested on R 4.x with several dependencies: parallel, foreach, doParallel, seqinr, Rsamtools, GenomicRanges, IRanges, data.table.

Installation

devtools::install_github("robinjugas/CNproScan")

Input files:

Several input files are neccessary:

  1. reference sequence FASTA file used in the read alignment
  2. sorted and indexed BAM file from the read-aligner (BWA assumed)
  3. bwa index -a is reference.fasta
    samtools faidx reference.fasta
    bwa mem reference.fasta read1.fq read2.fq > file.sam
    samtools view -b -F 4 file.sam > file.bam # mapped reads only
    samtools sort -o file.bam file1.bam
    samtools index file.bam
    
  4. coverage file (including zero values with -a swtich)
  5. samtools depth -a file.bam > file.coverage
    
  6. genome mappability file - obtained by GENMAP (https://github.com/cpockrandt/genmap) - only for the mappability normalization
  7. genmap index -F reference.fasta -I mapp_index
    genmap map -K 30 -E 2 -I mapp_index -O mapp_genmap -t -w -bg
    
  8. origin of replication position/s - obtained from DoriC (https://origin.tubic.org/doric/browse/bacteria) - only for the oriC normalization

Usage:

R script:

library("CNproScan")
# Working directory with files
setwd("workdir")
# File paths
fasta_file <- "reference.fasta"
bam_file <- "file.bam"
coverage_file <- "file.coverage"
bedgraph_file <- "mapp_genmap.bedgraph"

# For only GC normalization
DF <- CNproScanCNV(coverage_file, bam_file, fasta_file, 
                   GCnorm=TRUE, MAPnorm=FALSE, ORICnorm=FALSE, cores=4)

# Without any normalization
DF <- CNproScanCNV(coverage_file, bam_file, fasta_file, 
                   GCnorm=FALSE, MAPnorm=FALSE, ORICnorm=FALSE, cores=4)

# Both GC normalization and mappability normalization
DF <- CNproScanCNV(coverage_file, bam_file, fasta_file, 
                   GCnorm=TRUE, MAPnorm=TRUE, ORICnorm=FALSE, bedgraph_file, cores=4)

# Both GC normalization, mappability normalization and OriC normalization
DF <- CNproScanCNV(coverage_file, bam_file, fasta_file, 
                   GCnorm=TRUE, MAPnorm=TRUE, ORICnorm=TRUE, bedgraph_file, oriCposition=1, cores=4)

# or with multiple oriC positions
DF <- CNproScanCNV(coverage_file, bam_file, fasta_file, 
                   GCnorm=TRUE, MAPnorm=TRUE, ORICnorm=TRUE, bedgraph_file, oriCposition=c(10,5000), cores=4)

Caution : OriC normalization is working only in single-chromosome mode!

# Write VCF file (additional function from the package)
writeVCF(DF, "fileName.vcf")

# write TAB-separated file (optional)
write.table(DF, file = "TSVfile.tsv", row.names=FALSE, col.names = TRUE, sep="\t")


Inputs description:

Outputs:

In development - will be added:

Recent updates:

Note for multi chromosome/contig support:

BWA ignores the rest of FASTA header after the first whitespace. CNproScan expects all the headers to be the same. That means, the FASTA headers, BAM RNAME names and coverage file from samtools contain the same contig/chrosome names. The package uses seqinr::read.fasta where whole.header==FALSE crops header at the first whitespace. If this behaviour is issue, please post it as github issue.

Citation:

Robin Jugas, Karel Sedlar, Martin Vitek, Marketa Nykrynova, Vojtech Barton, Matej Bezdicek, Martina Lengerova, Helena Skutkova, CNproScan: Hybrid CNV detection for bacterial genomes, Genomics, Volume 113, Issue 5, 2021, Pages 3103-3111, ISSN 0888-7543, https://doi.org/10.1016/j.ygeno.2021.06.040. (https://www.sciencedirect.com/science/article/pii/S0888754321002779)



robinjugas/CNproScan documentation built on April 11, 2024, 7:15 p.m.