count_kmers: Counting k-mers in the dataset.

Description Usage Arguments Details Value Author(s) Examples

Description

counts the number of times each k-mer appears in the dataset and returns a dataframe indicating these counts. Each k-mer in each sequence is counted at most once, i.e., if there are multiple occureneces of a k-mer in one sequence, only one of them is counted.

Usage

1
2
count_kmers(obj, klen = 6, parallel = TRUE, nproc = ifelse(parallel,
  pbdMPI::comm.size(), 1), distributed = FALSE)

Arguments

obj

A filepath to a fasta file containing protein sequences or an AAStringSet object containing the sequences

klen

length of the k-mers to be used

parallel

Indicating whether the operation should be p erformed in parallel

nproc

Currently not supported. Will use all processors available to the job on cluster

distributed

A boolean, indicating whether the data is spread among multiple processors.

Details

If parallel is set to TRUE and distributed is set to FALSE, the method distributes the data between different processors and sets distributed to TRUE. Otherwise, if the parallel is set to FALSE and distributed is set to TRUE, the kmer frequencies are computed on each processor separately but then communicated between each other, and therefore at the end all processors have the same set of frequencies for kmers stored, using which they will generate frequency profiles for their chunk of sequences. If you prefer to run the operation in serial, set both parallel and distributed to FALSE.

Value

Returns a dataframe with two columns. Each row includes one k-mer and an integer indicating the number of times that k-mer appears in the input dataset. Each k-mer in each sequence is counted at most once, i.e., if there are multiple occureneces of a k-mer in one sequence, only one of them is counted.

Author(s)

Armen Abnousi

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
library(Biostrings)
library(data.table)
## Generate a set of three example protein sequences
seqs <- AAStringSet(c("seq1"="MLVVD",
                      "seq2"="PVVRA",
                      "seq3"="LVVR"))
## Count the kmers and generate a dataframe of the frequencies
str(seqs)
length(seqs)
freqs <- count_kmers(seqs, klen = 3, parallel = FALSE)
head(freqs)
##    kmer count
##1:  LVV  2
##2:  MLV  1
##3:  PVV  1
##4:  VRA  1
##5:  VVD  1
##6:  VVR  2

armenabnousi/naddaR documentation built on May 24, 2019, 8:47 p.m.