find_motifs: Find given motifs

View source: R/find_motifs.R

find_motifsR Documentation

Find given motifs

Description

Finds all given motifs in sequences and returns their positions.

Usage

find_motifs(x, ...)

## S3 method for class 'sq'
find_motifs(x, name, motifs, ..., NA_letter = getOption("tidysq_NA_letter"))

## S3 method for class 'data.frame'
find_motifs(
  x,
  motifs,
  ...,
  .sq = "sq",
  .name = "name",
  NA_letter = getOption("tidysq_NA_letter")
)

Arguments

x

[sq]
An object this function is applied to.

...

further arguments to be passed from or to other methods.

name

[character]
Vector of sequence names. Must be of the same length as sq object.

motifs

[character]
Motifs to be searched for.

NA_letter

[character(1)]
A string that is used to interpret and display NA value in the context of sq class. Default value equals to "!".

.sq

[character(1)]
Name of a column that stores sequences.

.name

[character(1)]
Name of a column that stores names (unique identifiers).

Details

This function allows search of a given motif or motifs in the sq object. It returns all motifs found with their start and end positions within a sequence.

Value

A tibble with following columns:

name

name of the sequence in which a motif was found

sought

sought motif

found

found subsequence, may differ from sought if the motif contained ambiguous letters

start

position of first element of found motif

end

position of last element of found motif

Motif capabilities and restrictions

There are more options than to simply create a motif that is a string representation of searched subsequence. For example, when using this function with any of standard types, i.e. ami, dna or rna, the user can create a motif with ambiguous letters. In this case the engine will try to match any of possible meanings of this letter. For example, take "B" from extended DNA alphabet. It means "not A", so it can be matched with "C", "G" and "T", but also "B", "Y" (either "C" or "T"), "K" (either "G" or "T") and "S" (either "C" or "G").

Full list of ambiguous letters with their meaning can be found on IUPAC site.

Motifs are also restricted in that the alphabets of sq objects on which search operations are conducted cannot contain "^" and "$" symbols. These two have a special meaning - they are used to indicate beginning and end of sequence respectively and can be used to limit the position of matched subsequences.

See Also

Functions interpreting sq in biological context: %has%(), complement(), translate()

Examples

# Creating objects to work on:
sq_dna <- sq(c("ATGCAGGA", "GACCGNBAACGAN", "TGACGAGCTTAG"),
             alphabet = "dna_bsc")
sq_ami <- sq(c("AGNTYIKFGGAYTI", "MATEGILIAADGYTWIL", "MIPADHICAANGIENAGIK"),
             alphabet = "ami_bsc")
sq_atp <- sq(c("mAmYmY", "nbAnsAmA", ""),
             alphabet = c("mA", "mY", "nbA", "nsA"))
sq_names <- c("sq1", "sq2", "sq3")

# Finding motif of two alanines followed by aspartic acid or asparagine
# ("AAB" motif matches "AAB", "AAD" and "AAN"):
find_motifs(sq_ami, sq_names, "AAB")

# Finding "C" at fourth position:
find_motifs(sq_dna, sq_names, "^NNNC")

# Finding motif "I" at second-to-last position:
find_motifs(sq_ami, sq_names, "IX$")

# Finding multiple motifs:
find_motifs(sq_dna, sq_names, c("^ABN", "ANCBY", "BAN$"))

# Finding multicharacter motifs:
find_motifs(sq_atp, sq_names, c("nsA", "mYmY$"))

# It can be a part of tidyverse pipeline:
library(dplyr)
fasta_file <- system.file(package = "tidysq", "examples/example_aa.fasta")
read_fasta(fasta_file) %>%
  mutate(name = toupper(name)) %>%
  find_motifs("TXG")


michbur/tidysq documentation built on April 1, 2022, 5:18 p.m.