scan_ag: Find AG glycomodules in protein sequences

Description Usage Arguments Value Note References See Also Examples

Description

AG glycomodules are amino acid dipeptides: OA, OS, OT, AO, SO and TO (and probably OG, OV, GO and VO) which are in close proximity to each other (Tan et al., 2003). Where: O - hydroxyproline, A - alanine, S - serine, T - threonine, G - glycine and V - valine. This function attempts to find the mentioned dipeptides according to user specified rules. Since the positions of hydroxyprolines are usually unknown, all prolines are considered instead. If any sequence from the supplied contains "O" the function will consider only true AG glycomodules.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
scan_ag(data, ...)

## S3 method for class 'character'
scan_ag(data, ...)

## S3 method for class 'data.frame'
scan_ag(data, sequence, id, ...)

## S3 method for class 'list'
scan_ag(data, ...)

## Default S3 method:
scan_ag(
  data = NULL,
  sequence,
  id,
  dim = 3L,
  div = 10L,
  type = c("conservative", "extended"),
  exclude_ext = c("no", "yes", "all"),
  simplify = TRUE,
  tidy = FALSE,
  ...
)

## S3 method for class 'AAStringSet'
scan_ag(data, ...)

Arguments

data

A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.

...

currently no additional arguments are accepted apart the ones documented bellow.

sequence

A vector of strings representing protein amino acid sequences, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank.

id

A vector of strings representing protein identifiers, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank.

dim

An integer defining the minimum number of close dipeptides to be considered, at default set to 3.

div

An integer defining the maximum number of amino acids that can separate the dipeptides for them to be considered, at default to 10

type

One of c("conservative", "extended"), if conservative only A, S and T will be considered as possible P|O partners in dipeptides, if extended dipeptides involving P|O with A, S, T, G and V will be considered. At default set to "extended".

exclude_ext

One of c("no", "yes", "all"), should extensin (SPPP+) regions be excluded from the search: "no" - do not exclude SPPP+; "yes" - exclude all SPPP+; "all" - exclude all PPP+

simplify

Boolean, should the function return a data frame or a list additional values.

tidy

Boolean, should the function return a tidy data frame instead of a list if simplify = FALSE.

Value

If simplify = TRUE, a data frame with one row per sequence, containing columns:

id

Character, as supplied in the function call.

sequence

Character, input sequence with amino acids in dipeptides (which satisfy the user set conditions) in uppercase.

AG_aa

Integer, number of matched amino acids in dipeptides.

total_length

Integer, total length of the found stretches of dipeptides including the amino acids between dipeptides in a match.

longest

Integer, maximum length of the found stretches of dipeptides including the amino acids between dipeptides in a match.

if simplify = FALSE and tidy = TRUE, a data frame with one row per match, with columns:

id

Character, as supplied in the function call.

sequence

Character, input sequence with amino acids in dipeptides that satisfy the user set conditions in uppercase

location.start

Integer, start of a match.

location.end

Integer, end of a match.

P_pos

List column, each element is an integer vector with AG-proline positions in each match.

AG_aa

Integer, number of amino acids in dipeptides in each match

If simplify = FALSE and tidy = FALSE, a list with elements:

id

Character vector, as supplied in the function call.

sequence

Character vector, each element corresponding to one input sequence, with matched letters (amino acids in dipeptides that satisfy the user set conditions) in uppercase

AG_aa

Integer vector, each element corresponding to the number of matched letters (amino acids in dipeptides that satisfy the user set conditions) in each input sequence

AG_locations

Named (by id) list of Integer vectors, each element corresponding to the locations of found dipeptides

total_length

Integer vector, with elements corresponding to the total length of the found stretches of dipeptides (including the amino acids between dipeptides in a match) in each sequence

longest

Integer vector, with elements corresponding to the maximum length of the found stretches of dipeptides (including the amino acids between dipeptides in a match) in each sequence

locations

Named (by id) list of numeric matrices, each element describing the start and end locations of the found stretches of dipeptides (including the amino acids between dipeptides in a match)

dim

Integer, as from input, default dim = 3

div

Integer, as from input, default div = 10

type

Character, as from input, one of c("conservative", "extended")

Note

The function can be supplied with the sequences resulting from predict_hyp in which case only AG glycomodules containing O instead of P will be considered.

References

Tan L, Leykam JF, Kieliszewski MJ. (2003) Glycosylation motifs that direct arabinogalactan addition to arabinogalactan proteins. Plant Physiol 132: 1362-136

See Also

maab predict_hyp

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
data(at_nsp)

# find all stretches of AP, SP, TP, PA, PS and PT dipeptides where there are at least
# 3 dipeptides separated by a maximum of 10 amino acids between each two dipeptides
at_nsp_ag <- scan_ag(sequence = at_nsp$sequence[1:20],
                     id = at_nsp$Transcript.id[1:20],
                     dim = 3,
                     div = 10,
                     type = "conservative")

# find all stretches of AP, SP, TP, GP, VP, PA, PS, PT PG, and PV dipeptides where there
# are at least 2 dipeptides separated by a maximum of 4 amino acids between them
at_nsp_ag <- scan_ag(sequence = at_nsp$sequence[1:20],
                     id = at_nsp$Transcript.id[1:20],
                     dim = 2,
                     div = 4,
                     type = "extended")

# check how much the results differ when extensin regions are excluded
at_sp_ag <- scan_ag(sequence = at_nsp$sequence,
                     id = at_nsp$Transcript.id,
                     dim = 3,
                     div = 6,
                     type = "extended")


at_sp_ag_ext <- scan_ag(sequence = at_nsp$sequence,
                     id = at_nsp$Transcript.id,
                     dim = 3,
                     div = 6,
                     type = "extended", exclude_ext = "yes")

at_sp_ag_ext$sequence[at_sp_ag_ext$sequence != at_sp_ag$sequence]

missuse/ragp documentation built on Jan. 4, 2022, 10:49 a.m.