parSeqSim: Parallel Protein Sequence Similarity Calculation Based on...
In protr: Generating Various Numerical Representation Schemes for Protein Sequences

parSeqSim

R Documentation

Parallel Protein Sequence Similarity Calculation Based on Sequence Alignment (In-Memory Version)

Description

Parallel calculation of protein sequence similarity based on sequence alignment.

Usage

parSeqSim(
  protlist,
  cores = 2,
  batches = 1,
  verbose = FALSE,
  type = "local",
  submat = "BLOSUM62",
  gap.opening = 10,
  gap.extension = 4
)

Arguments

`protlist`	A length `n` list containing `n` protein sequences, each component of the list is a character string, storing one protein sequence. Unknown sequences should be represented as `""`.
`cores`	Integer. The number of CPU cores to use for parallel execution, default is `2`. Users can use the `availableCores()` function in the parallelly package to see how many cores they could use.
`batches`	Integer. How many batches should we split the pairwise similarity computations into. This is useful when you have a large number of protein sequences, enough number of CPU cores, but not enough RAM to compute and fit all the pairwise similarities into a single batch. Defaults to 1.
`verbose`	Print the computation progress? Useful when `batches > 1`.
`type`	Type of alignment, default is `"local"`, can be `"global"` or `"local"`, where `"global"` represents Needleman-Wunsch global alignment; `"local"` represents Smith-Waterman local alignment.
`submat`	Substitution matrix, default is `"BLOSUM62"`, can be one of `"BLOSUM45"`, `"BLOSUM50"`, `"BLOSUM62"`, `"BLOSUM80"`, `"BLOSUM100"`, `"PAM30"`, `"PAM40"`, `"PAM70"`, `"PAM120"`, or `"PAM250"`.
`gap.opening`	The cost required to open a gap of any length in the alignment. Defaults to 10.
`gap.extension`	The cost to extend the length of an existing gap by 1. Defaults to 4.

Value

A n x n similarity matrix.

Author(s)

Nan Xiao <https://nanx.me>

Examples

## Not run: 

# Be careful when testing this since it involves parallelization
# and might produce unpredictable results in some environments

library("Biostrings")
library("foreach")
library("doParallel")

s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]
s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]]
s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]]
s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]]
s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]]
plist <- list(s1, s2, s3, s4, s5)
(psimmat <- parSeqSim(plist, cores = 2, type = "local", submat = "BLOSUM62"))

## End(Not run)

protr documentation built on Sept. 12, 2024, 6:44 a.m.