estGcDistn: Estimate a GC Content Distribution From Sequences
In steveped/ngsReports: Load FastqQC reports and other NGS related files

estGcDistn

R Documentation

Estimate a GC Content Distribution From Sequences

Description

Generate a GC content distribution from sequences for a given read length and fragment length

Usage

estGcDistn(x, n = 1e+06, rl = 100, fl = 200, fragSd = 30, bins = 101, ...)

## S4 method for signature 'ANY'
estGcDistn(x, n = 1e+06, rl = 100, fl = 200, fragSd = 30, bins = 101, ...)

## S4 method for signature 'character'
estGcDistn(x, n = 1e+06, rl = 100, fl = 200, fragSd = 30, bins = 101, ...)

## S4 method for signature 'DNAStringSet'
estGcDistn(x, n = 1e+06, rl = 100, fl = 200, fragSd = 30, bins = 101, ...)

Arguments

`x`	`DNAStringSet` or path to a fasta file
`n`	The number of reads to sample
`rl`	Read Lengths to sample
`fl`	The mean of the fragment lengths sequenced
`fragSd`	The standard deviation of the fragment lengths being sequenced
`bins`	The number of bins to estimate
`...`	Not used

Details

The function takes the supplied object and returns the theoretical GC content distribution. Using a fixed read length essentially leads to a discrete distribution so the bins argument is used to define the number of bins returned. This defaults to 101 for 0 to 100% inclusive.

The returned values are obtained by interpolating the values obtained during sampling. This avoids returned distributions with gaps and jumps as would be obtained setting readLengths at values not in multiples of 100.

Based heavily on https://github.com/mikelove/fastqcTheoreticalGC

Value

A tibble with two columns: GC_Content and Freq denoting the proportion of GC and frequency of occurence reqpectively

Examples

faDir <- system.file("extdata", package = "ngsReports")
faFile <- list.files(faDir, pattern = "fasta", full.names = TRUE)
df <- estGcDistn(faFile, n = 200)

steveped/ngsReports documentation built on June 13, 2025, 7:15 a.m.