summarize_quality_scores: Summarize Quality Scores

View source: R/summarize_quality_scores.R

summarize_quality_scoresR Documentation

Summarize Quality Scores

Description

For each base pair position, summarizes read length, Phred quality score, and the cumulative probability that all bases were called correctly.

Usage

summarize_quality_scores(
  forward_files,
  reverse_files,
  n.total = 10000,
  n.each = ceiling(n.total/length(forward_files)),
  seed = NULL,
  FUN = mean,
  ...
)

Arguments

forward_files

A character vector of file paths to FASTQ files containing forward DNA sequence reads.

reverse_files

A character vector of file paths to FASTQ files containing reverse DNA sequence reads.

n.total

Numeric. The number of read pairs to randomly sample from the input FASTQ files. Ignored if n.each is specified. The default is 10000.

n.each

Numeric. The number of read pairs to randomly sample from each pair of input FASTQ files. The default is ceiling(n.total/length(forward_files)).

seed

Numeric. The seed for randomly sampling read pairs. If NULL (the default), then a random seed is used.

FUN

A function to compute summary statistics of the quality scores. The default is mean.

...

Additional arguments passed to FUN.

Details

For each combination of base pair position and read direction, calculates summary statistics of read length, Phred quality score, and the cumulative probability that all bases were called correctly. The cumulative probability is calculated from the first base pair up to the current position. Quality scores are assumed to be encoded in Sanger format. Read pairs are selected by randomly sampling up to n.each read pairs from each pair of input FASTQ files. By default, n.each is derived from n.total, and n.total will be ignored if n.each is provided. By default, mean is used to compute the summary statistics, but the user may provide another summary function instead (e.g., median). Functions which return multiple summary statistics are also supported (e.g., summary and quantile). Arguments in ... are passed to the summary function.

Value

Returns a data frame containing summary statistics of read length and quality score at each base pair position. The returned data frame contains the following fields:

  • Direction: The read direction (i.e., "Forward" or "Reverse").

  • Position: The base pair position.

  • Length: The summary statistic(s) of read lengths. If FUN returns multiple summary statistics, then a matrix of the summary statistics will be stored in this field, which can be accessed with $Length.

  • Score: The summary statistic(s) of Phred quality scores. If FUN returns multiple summary statistics, then a matrix of the summary statistics will be stored in this field, which can be accessed with $Score.

  • Probability: The summary statistic(s) of the cumulative probability that all bases were called correctly. If FUN returns multiple summary statistics, then a matrix of the summary statistics will be stored in this field, which can be accessed with $Probability.

See Also

decode_quality_scores for decoding quality scores.

Examples

# Get example forward FASTQ files.
forward_files<-system.file("extdata",
                           paste0("S0",1:3,"F.fastq"),
                           package="LocaTT",
                           mustWork=TRUE)

# Get example reverse FASTQ files.
reverse_files<-system.file("extdata",
                           paste0("S0",1:3,"R.fastq"),
                           package="LocaTT",
                           mustWork=TRUE)

# Summarize quality scores.
summarize_quality_scores(forward_files,reverse_files)

LocaTT documentation built on June 14, 2026, 1:06 a.m.