analyze_sample: Analyze sequence table and categorize sequences

View source: R/analyze_sample.R

analyze_sampleR Documentation

Analyze sequence table and categorize sequences

Description

Converts a full STR sequence data frame into a per-locus version and adds a Category factor column to designate which sequences look like alleles, artifacts, etc. At this stage the summary is prepared for a single specific locus, in contrast to analyze_seqs. See the Details section below for a description of the factor levels in the new Category column, and see the Functions section below for how specific variants of this function behave.

Usage

analyze_sample(
  seq_data,
  sample_attrs,
  min_allele_abundance = cfg("min_allele_abundance")
)

analyze_sample_guided(
  seq_data,
  sample_attrs,
  min_allele_abundance = cfg("min_allele_abundance")
)

analyze_sample_naive(
  seq_data,
  sample_attrs,
  min_allele_abundance = cfg("min_allele_abundance")
)

Arguments

seq_data

data frame of processed data for sample as produced by analyze_seqs.

sample_attrs

list of sample attributes, such as the rows produced by prepare_dataset. Used to select the locus name to filter on.

min_allele_abundance

numeric threshold for the minimum proportion of counts a given entry must have, compared to the total matching all criteria for that locus, to be considered as a potential allele.

Details

Factor levels in the added Category column, in order:

  • Allele: An identified allele sequence. There will be between zero and two of these.

  • Prominent: Any additional sequences beyond two called alleles that match all requirements (sequences that match all locus attributes, do not appear artifactual, and are above a given fraction of filtered reads).

  • Insignificant: Sequences with counts below the min_allele_abundance threshold.

  • Ambiguous: Sequences passing the min_allele_abundance threshold but with non-ACTG characters such as N, as defined by the Ambiguous column of seq_data.

  • Stutter: Sequences passing the min_allele_abundance threshold but matching stutter sequence criteria as defined by the Stutter column of seq_data.

  • Artifact: Sequences passing the min_allele_abundance threshold but matching non-stutter artifact sequence criteria as defined by the Artifact column of seq_data.

Value

filtered version of seq_data with added Category column.

Functions

  • analyze_sample(): default version of sample analysis. From here use summarize_sample.

  • analyze_sample_guided(): version of sample analysis guided by expected sequence length values. Additional items ExpectedLength1 and optionally ExpectedLength2 can be supplied in the sample_attrs list. If NA or missing the behavior will match analyze_sample. If two expected lengths are given, the min_allele_abundance argument is ignored. If at least one expected length is given, the stutter/artifact filtering is disabled. From here use summarize_sample_guided.

  • analyze_sample_naive(): version of sample analysis without stutter/artifact filtering. From here use summarize_sample as for analyze_sample.


ressy/microsat documentation built on Aug. 24, 2023, 10:09 a.m.