analyze_seqs: Analyze a set of STR sequences

View source: R/analyze_seqs.R

analyze_seqsR Documentation

Analyze a set of STR sequences

Description

Dereplicates the given sequences and annotates any STR sequences found, returning the processed data as a data frame with one row per unique sequence, sorted by count. At this stage no information is filtered out, and all loci are treated equally.

Usage

analyze_seqs(
  seqs,
  locus_attrs,
  nrepeats = cfg("min_motif_repeats"),
  max_stutter_ratio = cfg("max_stutter_ratio"),
  artifact.count.ratio_max = cfg("max_artifact_ratio"),
  ...
)

Arguments

seqs

character vector containing sequences.

locus_attrs

data frame of attributes for loci to look for.

nrepeats

number of repeats of each locus' motif to require for a match.

max_stutter_ratio

highest ratio of read counts for second most frequent sequence to the most frequent where the second will be considered stutter.

artifact.count.ratio_max

as for max_stutter_ratio but for non-stutter artifact sequences.

...

additional arguments for make_read_primer_table

Details

Columns in the returned data frame:

  • Seq: sequence text for each unique sequence

  • Count: integer count of occurrences of this exact sequence

  • Length: integer sequence length

  • MatchingLocus: factor for the name of the locus matching each sequence, by checking the primer

  • MotifMatch: logical: are there are least nrepeats perfect adjacent repeats of the STR motif for the matching locus?

  • LengthMatch: logical: is the sequence length within the expected range for the matching locus?

  • Ambiguous: logical: are there unexpected characters in the sequence content?

  • Stutter: integer: for any sequence that looks like potential PCR stutter, the index of the row that may be the source of the stutter band.

  • Artifact: integer: for any sequence that looks like potential PCR artifact (other than stutter), the index of the row that may be the source of the stutter band.

  • FractionOfTotal: numeric fraction of the number of sequences represented by each unique sequence compared to the total.

  • FractionOfLocus: numeric fraction of the number of sequences represented by each unique sequence compared to the total for that particular matching locus.

Value

data frame of dereplicated sequences with added annotations.

Examples

# Starting from non-locus-specific sequences,
# a locus attributes table, and requiring
# three side-by-side motif repeats to register
# as a motif match for a locus,
raw_seq_vector <- c(test_data$seqs1$A, test_data$seqs1$B)
locus_attrs <- test_data$locus_attrs
num_adjacent_repeats <- 3
# Convert the character vector of sequences
# into a data frame with one row per
# unique sequence.
seq_data <- analyze_seqs(raw_seq_vector,
                         locus_attrs,
                         num_adjacent_repeats)

ShawHahnLab/microsat documentation built on Aug. 25, 2023, 11:16 p.m.