ReQON: Recalibrating Quality Of Nucleotides

Description Usage Arguments Details Value Note Author(s) Examples

View source: R/ReQON.R

Description

Recalibrate the nucleotide quality scores of either single-end or paired-end next-generation sequencing data that has been aligned.

Usage

1
2
ReQON(in_bam, out_bam, region, max_train = -1, SNP = "", 
   RefSeq = "", nerr = 2, nraf = 0.05, plotname = "", temp_files = 0)

Arguments

in_bam

file name of sorted BAM file of single-end or paired-end aligned sequencing data. The corresponding index file (.bai file) must be located in the same directory.

out_bam

file name for output BAM file with original quality scores replaced with recalibrated quality scores.

region

training region for recalibration, as “chromosome:start-end”. Cannot span more than one chromosome. See note.

Example: "chr1:1-10000".

max_train

maximum number of nucleotides to include in training region. Useful if you want to train on e.g. the first 5 million bases of chromosome 10. Default = -1 (use all nucleotides from training region).

SNP

file of SNP locations to remove from training set before recalibration. Text or Rdata file (with variable name “snp”) with no header and two columns: [1] chromosome, [2] position. See note.

Default: do not remove any nucleotides from training set.

RefSeq

file of reference sequence for training set to identify sequencing errors (i.e, nucleotide is error if it does not match RefSeq). Text or Rdata file (with variable name “ref”) with no header and three columns: [1] chromosome, [2] position, [3] reference nucleotide (A,C,G,T). See note.

Default: errors are nucleotides not matching major allele(s) for coverage > 2, removing all nucleotides at positions with coverage of 2 or less.

nerr

maximum number of errors tolerated at a genomic position. Positions with more than “nerr” errors may likely be true variants, so bases from these positions are removed from the training set. Default = 2.

nraf

maximum non-reference allele frequency at a genomic position that is allowed. Positions with non-reference allele frequency greater than “nraf” are removed from the training set for the same reason as above. Default = 0.05.

plotname

file name for saving recalibration plots in pdf. If not specified, plots will not be produced.

temp_files

option for keeping temporary files.

0: (default) remove all temporary files.

1: keep temporary files in working directory.

Details

ReQON uses logistic regression to recalibrate the nucleotide quality scores of a sorted BAM file. The BAM file contains either single-end or paired-end next-generation sequencing data that has been aligned using any alignment tool. For help with sorting and indexing BAM files in R, see Rsamtools.

ReQON also has the option to output diagnostic plots which show the effectiveness of the recalibration on the training set.

For a detailed description of usage, output and images, see the vignette by: browseVignettes("ReQON").

ReQON utilizes various java tools provided by Picard. For more information on Picard, see http://picard.sourceforge.net

Value

ReQON returns a BAM file, replacing the original quality scores with the recalibrated quality scores in the QUAL field.

ReQON also outputs a data object of diagnostic data from the training set that is plotted in the output diagnostic plots. The object variables are:

$ReadPosErrors

vector of error counts by read position.

$QualFreqBefore

relative frequency of quality scores before recalibration. The first element in the vector corresponds to a quality score of zero.

$QualFreqAfter

relative frequency of quality scores after recalibration. The first element in the vector corresponds to a quality score of zero.

$ErrRatesBefore

vector of empirical error rates before recalibration, reported on the Phred scale. The first element in the vector corresponds to a quality score of zero.

$ErrRatesAfter

vector of empirical error rates after recalibration, reported on the Phred scale. The first element in the vector corresponds to a quality score of zero.

$FWSE

vector of Frequency-Weighted Squared Error (FWSE) values. The first element is FWSE before recalibration and the second element is FWSE after recalibration.

$FlagPos

vector of high-error read positions (above dashed cyan line in top left output plot). Each of these positions receives an indicator variable in the model.

$coeff

vector of regression coefficient obtained from training set used to recalibrate entire BAM file.

Note

Be aware of how the chromosomes are referenced when specifying the training region. For example, one BAM file may require specifying “10:1-2000” while another may need “chr10:1-2000”.

If providing SNP or RefSeq files, computations will speed up if your file only covers the positions in the training region. For example, if you set region = “chr10:1-2000”, then we recommend only having rows corresponding to chr10:1-2000 in the RefSeq/SNP file.

Author(s)

Christopher Cabanski ccabansk@genome.wustl.edu

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
## Read in sample data from seqbias package
library( ReQON )
library( seqbias )
library( Rsamtools )
ref_fn <- system.file( "extra/example.fa", package = "seqbias" )
ref_f <- FaFile( ref_fn )
open.FaFile( ref_f )
reads_fn <- system.file( "extra/example.bam", package = "seqbias" )

## Set up file of reference sequence
seqs <- scanFa( ref_f )
len <- length( seqs[[1]] )
ref <- matrix( nrow = len, ncol = 3 )
ref[,1] <- rep( "seq1", len )
ref[,2] <- c( 1:len )
str <- toString( subseq( seqs[[1]], 1, len ) )
s <- strsplit( str, NULL )
ref[,3] <- s[[1]]
write.table( ref, file = "ref_seq.txt", sep = "\t", quote = FALSE,
   row.names = FALSE, col.names = FALSE )

## Recalibrate File
sorted <- sortBam( reads_fn, tempfile() )
indexBam( sorted )
reg <- paste( "seq1:1-", len, sep = "" )
diagnostics <- ReQON( sorted, "Recalibrated_example.bam", reg, 
   RefSeq = "ref_seq.txt", nerr = 20, nraf = 0.25, 
   plotname = "Recalibrated_example_plots.pdf" )

#Remove temporary file
unlink( "ref_seq.txt" )

ReQON documentation built on Nov. 8, 2020, 6:53 p.m.