crossSetSim: Parallellized Protein Sequence Similarity Calculation Between...

View source: R/par-01-parSeqSim.R

crossSetSimR Documentation

Parallellized Protein Sequence Similarity Calculation Between Two Sets Based on Sequence Alignment (In-Memory Version)

Description

Parallellized calculation of protein sequence similarity based on sequence alignment between two sets of protein sequences.

Usage

crossSetSim(
  protlist1,
  protlist2,
  type = "local",
  cores = 2,
  submat = "BLOSUM62",
  gap.opening = 10,
  gap.extension = 4
)

Arguments

protlist1

A length n list containing n protein sequences, each component of the list is a character string, storing one protein sequence. Unknown sequences should be represented as "".

protlist2

A length n list containing m protein sequences, each component of the list is a character string, storing one protein sequence. Unknown sequences should be represented as "".

type

Type of alignment, default is "local", can be "global" or "local", where "global" represents Needleman-Wunsch global alignment; "local" represents Smith-Waterman local alignment.

cores

Integer. The number of CPU cores to use for parallel execution, default is 2. Users can use the detectCores() function in the parallel package to see how many cores they could use.

submat

Substitution matrix, default is "BLOSUM62", can be one of "BLOSUM45", "BLOSUM50", "BLOSUM62", "BLOSUM80", "BLOSUM100", "PAM30", "PAM40", "PAM70", "PAM120", or "PAM250".

gap.opening

The cost required to open a gap of any length in the alignment. Defaults to 10.

gap.extension

The cost to extend the length of an existing gap by 1. Defaults to 4.

Value

A n x m similarity matrix.

Author(s)

Sebastian Mueller <https://alva-genomics.com>

Examples

## Not run: 

# Be careful when testing this since it involves parallelisation
# and might produce unpredictable results in some environments

library("Biostrings")
library("foreach")
library("doParallel")

s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]
s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]]
s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]]
s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]]
s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]]

plist1 <- list(s1 = s1, s2 = s2, s4 = s4)
plist2 <- list(s3 = s3, s4_again = s4, s5 = s5, s1_again = s1)
psimmat <- crossSetSim(plist1, plist2)
colnames(psimmat) <- names(plist1)
rownames(psimmat) <- names(plist2)
print(psimmat)
#                 s1         s2         s4
# s3       0.10236985 0.18858241 0.05819984
# s4_again 0.04921696 0.12124217 1.00000000
# s5       0.03943488 0.06391103 0.05714638
# s1_again 1.00000000 0.11825938 0.04921696

## End(Not run)

protr documentation built on Nov. 2, 2023, 6:04 p.m.