find_motifs: Find given motifs
In michbur/tidysq: Tidy Processing and Analysis of Biological Sequences

find_motifs

R Documentation

Find given motifs

Description

Finds all given motifs in sequences and returns their positions.

Usage

find_motifs(x, ...)

## S3 method for class 'sq'
find_motifs(x, name, motifs, ..., NA_letter = getOption("tidysq_NA_letter"))

## S3 method for class 'data.frame'
find_motifs(
  x,
  motifs,
  ...,
  .sq = "sq",
  .name = "name",
  NA_letter = getOption("tidysq_NA_letter")
)

Arguments

`x`	[`sq`] An object this function is applied to.
`...`	further arguments to be passed from or to other methods.
`name`	[`character`] Vector of sequence names. Must be of the same length as `sq` object.
`motifs`	[`character`] Motifs to be searched for.
`NA_letter`	[`character(1)`] A string that is used to interpret and display `NA` value in the context of `sq class`. Default value equals to "`!`".
`.sq`	[`character(1)`] Name of a column that stores sequences.
`.name`	[`character(1)`] Name of a column that stores names (unique identifiers).

Details

This function allows search of a given motif or motifs in the sq object. It returns all motifs found with their start and end positions within a sequence.

Value

A tibble with following columns:

`name`	name of the sequence in which a motif was found
`sought`	sought motif
`found`	found subsequence, may differ from sought if the motif contained ambiguous letters
`start`	position of first element of found motif
`end`	position of last element of found motif

Motif capabilities and restrictions

There are more options than to simply create a motif that is a string representation of searched subsequence. For example, when using this function with any of standard types, i.e. ami, dna or rna, the user can create a motif with ambiguous letters. In this case the engine will try to match any of possible meanings of this letter. For example, take "B" from extended DNA alphabet. It means "not A", so it can be matched with "C", "G" and "T", but also "B", "Y" (either "C" or "T"), "K" (either "G" or "T") and "S" (either "C" or "G").

Full list of ambiguous letters with their meaning can be found on IUPAC site.

Motifs are also restricted in that the alphabets of sq objects on which search operations are conducted cannot contain "^" and "$" symbols. These two have a special meaning - they are used to indicate beginning and end of sequence respectively and can be used to limit the position of matched subsequences.

Examples

# Creating objects to work on:
sq_dna <- sq(c("ATGCAGGA", "GACCGNBAACGAN", "TGACGAGCTTAG"),
             alphabet = "dna_bsc")
sq_ami <- sq(c("AGNTYIKFGGAYTI", "MATEGILIAADGYTWIL", "MIPADHICAANGIENAGIK"),
             alphabet = "ami_bsc")
sq_atp <- sq(c("mAmYmY", "nbAnsAmA", ""),
             alphabet = c("mA", "mY", "nbA", "nsA"))
sq_names <- c("sq1", "sq2", "sq3")

# Finding motif of two alanines followed by aspartic acid or asparagine
# ("AAB" motif matches "AAB", "AAD" and "AAN"):
find_motifs(sq_ami, sq_names, "AAB")

# Finding "C" at fourth position:
find_motifs(sq_dna, sq_names, "^NNNC")

# Finding motif "I" at second-to-last position:
find_motifs(sq_ami, sq_names, "IX$")

# Finding multiple motifs:
find_motifs(sq_dna, sq_names, c("^ABN", "ANCBY", "BAN$"))

# Finding multicharacter motifs:
find_motifs(sq_atp, sq_names, c("nsA", "mYmY$"))

# It can be a part of tidyverse pipeline:
library(dplyr)
fasta_file <- system.file(package = "tidysq", "examples/example_aa.fasta")
read_fasta(fasta_file) %>%
  mutate(name = toupper(name)) %>%
  find_motifs("TXG")

michbur/tidysq documentation built on Jan. 2, 2025, 10:41 p.m.