scan_ag: Find AG glycomodules in protein sequences
In missuse/ragp: Mining for Hydroxyproline rich glycoprotein sequences

Description Usage Arguments Value Note References See Also Examples

AG glycomodules are amino acid dipeptides: OA, OS, OT, AO, SO and TO (and probably OG, OV, GO and VO) which are in close proximity to each other (Tan et al., 2003). Where: O - hydroxyproline, A - alanine, S - serine, T - threonine, G - glycine and V - valine. This function attempts to find the mentioned dipeptides according to user specified rules. Since the positions of hydroxyprolines are usually unknown, all prolines are considered instead. If any sequence from the supplied contains "O" the function will consider only true AG glycomodules.

scan_ag(data, ...)

## S3 method for class 'character'
scan_ag(data, ...)

## S3 method for class 'data.frame'
scan_ag(data, sequence, id, ...)

## S3 method for class 'list'
scan_ag(data, ...)

## Default S3 method:
scan_ag(
  data = NULL,
  sequence,
  id,
  dim = 3L,
  div = 10L,
  type = c("conservative", "extended"),
  exclude_ext = c("no", "yes", "all"),
  simplify = TRUE,
  tidy = FALSE,
  ...
)

## S3 method for class 'AAStringSet'
scan_ag(data, ...)

`data`	A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class `SeqFastaAA` resulting from `read.fasta` call. Alternatively an `AAStringSet` object. Should be left blank if vectors are provided to sequence and id arguments.
`...`	currently no additional arguments are accepted apart the ones documented bellow.
`sequence`	A vector of strings representing protein amino acid sequences, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank.
`id`	A vector of strings representing protein identifiers, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank.
`dim`	An integer defining the minimum number of close dipeptides to be considered, at default set to 3.
`div`	An integer defining the maximum number of amino acids that can separate the dipeptides for them to be considered, at default to 10
`type`	One of c("conservative", "extended"), if conservative only A, S and T will be considered as possible P\|O partners in dipeptides, if extended dipeptides involving P\|O with A, S, T, G and V will be considered. At default set to "extended".
`exclude_ext`	One of c("no", "yes", "all"), should extensin (SPPP+) regions be excluded from the search: "no" - do not exclude SPPP+; "yes" - exclude all SPPP+; "all" - exclude all PPP+
`simplify`	Boolean, should the function return a data frame or a list additional values.
`tidy`	Boolean, should the function return a tidy data frame instead of a list if simplify = FALSE.

If simplify = TRUE, a data frame with one row per sequence, containing columns:

id: Character, as supplied in the function call.
sequence: Character, input sequence with amino acids in dipeptides (which satisfy the user set conditions) in uppercase.
AG_aa: Integer, number of matched amino acids in dipeptides.
total_length: Integer, total length of the found stretches of dipeptides including the amino acids between dipeptides in a match.
longest: Integer, maximum length of the found stretches of dipeptides including the amino acids between dipeptides in a match.

if simplify = FALSE and tidy = TRUE, a data frame with one row per match, with columns:

id: Character, as supplied in the function call.
sequence: Character, input sequence with amino acids in dipeptides that satisfy the user set conditions in uppercase
location.start: Integer, start of a match.
location.end: Integer, end of a match.
P_pos: List column, each element is an integer vector with AG-proline positions in each match.
AG_aa: Integer, number of amino acids in dipeptides in each match

If simplify = FALSE and tidy = FALSE, a list with elements:

id: Character vector, as supplied in the function call.
sequence: Character vector, each element corresponding to one input sequence, with matched letters (amino acids in dipeptides that satisfy the user set conditions) in uppercase
AG_aa: Integer vector, each element corresponding to the number of matched letters (amino acids in dipeptides that satisfy the user set conditions) in each input sequence
AG_locations: Named (by id) list of Integer vectors, each element corresponding to the locations of found dipeptides
total_length: Integer vector, with elements corresponding to the total length of the found stretches of dipeptides (including the amino acids between dipeptides in a match) in each sequence
longest: Integer vector, with elements corresponding to the maximum length of the found stretches of dipeptides (including the amino acids between dipeptides in a match) in each sequence
locations: Named (by id) list of numeric matrices, each element describing the start and end locations of the found stretches of dipeptides (including the amino acids between dipeptides in a match)
dim: Integer, as from input, default dim = 3
div: Integer, as from input, default div = 10
type: Character, as from input, one of c("conservative", "extended")

The function can be supplied with the sequences resulting from predict_hyp in which case only AG glycomodules containing O instead of P will be considered.

Tan L, Leykam JF, Kieliszewski MJ. (2003) Glycosylation motifs that direct arabinogalactan addition to arabinogalactan proteins. Plant Physiol 132: 1362-136

maab predict_hyp

data(at_nsp)

# find all stretches of AP, SP, TP, PA, PS and PT dipeptides where there are at least
# 3 dipeptides separated by a maximum of 10 amino acids between each two dipeptides
at_nsp_ag <- scan_ag(sequence = at_nsp$sequence[1:20],
                     id = at_nsp$Transcript.id[1:20],
                     dim = 3,
                     div = 10,
                     type = "conservative")

# find all stretches of AP, SP, TP, GP, VP, PA, PS, PT PG, and PV dipeptides where there
# are at least 2 dipeptides separated by a maximum of 4 amino acids between them
at_nsp_ag <- scan_ag(sequence = at_nsp$sequence[1:20],
                     id = at_nsp$Transcript.id[1:20],
                     dim = 2,
                     div = 4,
                     type = "extended")

# check how much the results differ when extensin regions are excluded
at_sp_ag <- scan_ag(sequence = at_nsp$sequence,
                     id = at_nsp$Transcript.id,
                     dim = 3,
                     div = 6,
                     type = "extended")


at_sp_ag_ext <- scan_ag(sequence = at_nsp$sequence,
                     id = at_nsp$Transcript.id,
                     dim = 3,
                     div = 6,
                     type = "extended", exclude_ext = "yes")

at_sp_ag_ext$sequence[at_sp_ag_ext$sequence != at_sp_ag$sequence]