filter_supporting_reads: Filter samples by the minimum supporting reads for alleles.

View source: R/filter_supporting_reads.R

filter_supporting_readsR Documentation

Filter samples by the minimum supporting reads for alleles.

Description

This function can almost be seen like a minor allele frequency or count filter at the level of a the sample (instead of the whole dataset). It will mark a sample as having insufficient supporting reads for the allele with lower coverage if they are below a certain threshold. This might be useful, for example, when using pooled allele frequencies, or when genotypes individuals are sequenced at low-to-moderate coverage.

Usage

filter_supporting_reads(
  dat,
  sampCol = "SAMPLE",
  locusCol = "LOCUS",
  dpCol = "DP",
  aoCol = "AO",
  suppReads = 3
)

Arguments

dat

Data.table: Contains the information of samples, loci, the total depth of coverage, and the read count of the alterante allele. The reference allele read count is assumed to be 1 - alternate allele read count. Must contain the columns:

  1. The sample ID (see param sampCol).

  2. The locus ID (see param locusCol).

  3. The total read depth (see param dpCol).

  4. The alternate allele read counts (see param aoCol).

sampCol

Character: The column with the sample information. Default = 'SAMPLE'.

locusCol

Character: The column with the locus information. Default = 'LOCUS'.

dpCol

Character: The column with the total read depth information. Default = 'DP'.

aoCol

Character: The column with the alternate allele read count information. Default = 'AO'.

suppReads

Integer: The minimum number of supporting reads for the allele that is least well covered by reads within a sample.

Details

Note, this sample will only evaluate sites for each there are reads supporting both alleles. It will not evaluate sites that only have reads for the reference alleles, or only have reads for the alternate allele.

Value

Returns a data.table with the columns $SAMPLE and $LOCUS, the sample and locus information, and KEEP, a logical column with TRUE or FALSE indicating whether a sample + locus observation should be kept based on uncertainty in the supporting reads. Note, all samples + loci observations are returned, such that they will match dat. This facilitates merging of the original data and results.

Examples

library(genomalicious)
data(data_Genos)

# Take a look at the read distribution for alternate alleles
hist(data_Genos$AO, xlab='Alt allele read counts', main='')

# Let's find those sample + loci observations where there are not
# at least 5 reads supporting each allele
suppTest <- filter_supporting_reads(data_Genos, suppReads=5)

head(suppTest)

suppTest[KEEP==FALSE]

# You could use this information to filter loci. For example, removing
# a locus if any sample does not meet the supporting read threshold for
# both alleles.
uniq_bad_loci <- unique(suppTest[KEEP==FALSE]$LOCUS)

uniq_bad_loci

data_Genos[!LOCUS %in% uniq_bad_loci]


j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.