In jakubkala/QuiPTsim:

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
  )

library(QuiPTsim)

Introduction

QuiPTsim is a framework generating matrices of n-gram occurences of given sequences, which will be further used to assess feature selection algorithm QuiPT, which allows very fast feature filtering.

QuiPT details

# source("./inst/QuiPTdiagramme.R")
# my_graphviz

Simulation pipeline

Function create_simulation_data() is a high-level wrapper of function implemented in this package. The purpose of this package is to automate data generation which will be further used in QuiPT assessment.

Results of each simulation are placed in previously stated directory. The csv file contains details of every Rds file computed during simulation. There is also a path column which allows us to link Rds files with specific csv rows.

Simulation highlights:

Each simulation has constant number of sequences ($6000$), since this is an upper bound for most of datasets and these can be sampled in case we would need smaller datasets
Various alphabet sizes are used sequences have been sampled idependently:
Full 20-element alphabet with various probability vectors (based on drake repo) AmpGram
Simplified alphabet (from literature and AmyloGram) with probabilites from AmyloGram analysis
Each simulation has been repeated $100$ in order to ensure stability of results
Half of the sequences are positive (have motif injected) -- we can manipulate fraction parameter by sampling full dataset
Parameters $n=4$ and $d=6$ are fixed in all considerations (this gives maximum sequence length 10)

Sequence generation

generate_single_motif : function generates motif from an alphabet : n - number of elements sampled from an alphabet : d - total sum of gaps : motifProbs - (optional) probabilites of alphabet elements

generate_single_motif(1:4)
generate_single_motif(c("a", "b", "c"))
generate_single_motif(1:4, n = 6, d = 2, motifProbs = c(0.7, 0.1, 0.1, 0.1))

In our simulations, we have decided to use $n=4$ and $d=6$, which gives maximum motif length being 10.

generate_motifs : function returns list of motifs (results of generate_single_motif()) : function asserts if such motifs can be injected to a sequence (sequence length = 10)

print(generate_motifs(alphabet = letters[1:6], n_motif = 3))

simulate_single_sequence : generates sequence of elements from alphabet with replacement : (optional) alphabet elements' probabilites can be specified

print(simulate_single_sequence(len = 10,
                         alphabet = letters[1:6]))

# we can also generate sequence based on sequence elements probabilites
probs <- runif(6)
probs <- probs / sum(probs)
print(probs)
print(simulate_single_sequence(len = 10,
                         alphabet = letters[1:6],
                         seqProbs = probs))

add_motifs : function injects motifs to a sequence

each sequence is returned with attributes: - list of motifs - list of according masks

Important thing about motif injection is the fact, that multiple motifs can overlap each other.

alph <- letters[1:6]

example_sequence <- simulate_single_sequence(len = 10, alph)
motifs_list <- generate_motifs(alph, 2)
print(example_sequence)
print(motifs_list)
injected_sequence <- add_motifs(sequence = example_sequence,
                                motifs = motifs_list)
print(injected_sequence)

simulate_sequences : function generates sequences (both positive & negative) : matrix of sequences is returned : motifs are injected to a fraction of sequences : following attributes are added: * list of motifs * list of masks * target variable (flag if motifs have been injected to a sequence)

numberOfSequences <- 10
sequencesLength <- 10
alphabet <- letters[1:4]
motifs <- generate_motifs(alph, 3)

# Although motifs list contain 3 independent motifs
# only one will be injected
# Since fraction = 0.2, motifs will be injected only into 2 sequences
simulate_sequences(numberOfSequences,
                   sequencesLength, 
                   alphabet,
                   motifs,
                   n_motifs = 1,
                   fraction = 0.2)

generate_sequences : wrapper for a simulate_sequences() and place where n-grams are counted : n-grams are counted by count_multimers() function from seqR package, which allows us to simulate data at terrific speed

Utilization of seqR::count_multimers() is tricky, as we are supposed to deliver vector of all k-mers' lengths along with vector of gaps for each corresponding k-mer length.

Simulation data

create_simulation_data : function is a wrapper of functions implemented in sequence generation block.

# alphabet elements
alph <- letters[1:6]

# number of replication for each set of parameters (sequence length, number of motifs, etc.)
reps <- 10

# Number of sequences to be generated
n_seq <- c(10)

# Sequences' length
l_seq <- c(10)

# Number of motifs that will be injected to a sequence
n_motifs <- c(1, 2)

# results will be saved in this location
path <- "./"

# triggers if matrices will be saved (debugging purposes)
save_files <- FALSE

# title of simulation
title <- "SEQ"

results <- create_simulation_data(reps, n_seq, l_seq, n_motifs, alph, path, title, save_files)
results

On top of above example, create_simulation_data allows us to specify probability vectors for both motifs and sequences. Furthermore, MD5 sums are being collected right after object save.

RDS files saved during simulations contain sparse matrix with n-gram counts along with target (column with 1/0 flag showing if given sequence has been injected) and two lists: motifs and their masks.

Cosine similarity

The purpose of this part comes from the need for creating vector of probabilites that have exact cosine similarity. Amyloidogenic and non-amyloidogenic sequences have different aa distribution. We would like to mimic this property by generating vector of aa probabilities that have previously observed cosine similarity.

euclidean_norm : returns euclidean norm of a vector

x <- runif(5)
print(x)
print(euclidean_norm(x))

cosine_similarity : return cosine similarity of two vectors

$$ cosine~similarity(u, v) = \frac{}{|u||v|} $$

x <- runif(5)
y <- runif(5)
print(x)
print(y)
print(cosine_similarity(x, y))

generate_probs : return two vectors of probabilites with given cosine similarity : note: as cosine similarity parameter approaches 0.5, some computational issues may arrive

size <- 6
cs <- 0.8

probabilites <- generate_probs(size, cs)
print(probabilites)

print(cosine_similarity(probabilites[1, ], probabilites[2, ]))

N-gram probabilites

sequenceTransitionMatrix : function takes two arguments: matrix of sequences (sequences in rows), alphabet(Markov chain states) : returns markovchain class instance

alphabet <- letters[1:4]
sequences <- matrix(sample(alphabet, size = 50, replace=TRUE), nrow = 5, ncol = 10)
sequenceTransitionMatrix(sequences, alphabet)

calculate_seq_prob : function calculates probabiliity of given trajectory : takes two arguments: markovchain class instance and a sequence

Note: Inital state is not taken into considerations

alphabet <- letters[1:4]
sequences <- matrix(sample(alphabet, size = 50, replace=TRUE), nrow = 5, ncol = 10)
mc <- sequenceTransitionMatrix(sequences, alphabet)
print(mc)
example_seq <- sample(alphabet, size = 5, replace = TRUE)
print(example_seq)
calculate_seq_prob(mc, example_seq)

calculate_ngram_prob : function is an extension for calculate_seq_prob; we can compute probability of trajectories with gaps

alphabet <- letters[1:4]
sequences <- matrix(sample(alphabet, size = 50, replace=TRUE), nrow = 5, ncol = 10)
mc <- sequenceTransitionMatrix(sequences, alphabet)
example_ngram <- sample(c(alphabet, "_"), size = 5, replace = TRUE)
print(example_ngram)
calculate_ngram_prob(mc, example_ngram)