knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(QuiPTsim)
QuiPTsim is a framework generating matrices of n-gram occurences of given sequences, which will be further used to assess feature selection algorithm QuiPT, which allows very fast feature filtering.
# source("./inst/QuiPTdiagramme.R") # my_graphviz
Function create_simulation_data()
is a high-level wrapper of function implemented in this package. The purpose of this package is to automate data generation which will be further used in QuiPT assessment.
Results of each simulation are placed in previously stated directory. The csv file contains details of every Rds file computed during simulation. There is also a path
column which allows us to link Rds files with specific csv rows.
Simulation highlights:
generate_single_motif
: function generates motif from an alphabet
: n
- number of elements sampled from an alphabet
: d
- total sum of gaps
: motifProbs
- (optional) probabilites of alphabet elements
generate_single_motif(1:4) generate_single_motif(c("a", "b", "c")) generate_single_motif(1:4, n = 6, d = 2, motifProbs = c(0.7, 0.1, 0.1, 0.1))
In our simulations, we have decided to use $n=4$ and $d=6$, which gives maximum motif length being 10.
generate_motifs
: function returns list of motifs (results of generate_single_motif()
)
: function asserts if such motifs can be injected to a sequence (sequence length = 10)
print(generate_motifs(alphabet = letters[1:6], n_motif = 3))
simulate_single_sequence : generates sequence of elements from alphabet with replacement : (optional) alphabet elements' probabilites can be specified
print(simulate_single_sequence(len = 10, alphabet = letters[1:6])) # we can also generate sequence based on sequence elements probabilites probs <- runif(6) probs <- probs / sum(probs) print(probs) print(simulate_single_sequence(len = 10, alphabet = letters[1:6], seqProbs = probs))
add_motifs : function injects motifs to a sequence
each sequence is returned with attributes: - list of motifs - list of according masks
Important thing about motif injection is the fact, that multiple motifs can overlap each other.
alph <- letters[1:6] example_sequence <- simulate_single_sequence(len = 10, alph) motifs_list <- generate_motifs(alph, 2) print(example_sequence) print(motifs_list) injected_sequence <- add_motifs(sequence = example_sequence, motifs = motifs_list) print(injected_sequence)
simulate_sequences : function generates sequences (both positive & negative) : matrix of sequences is returned : motifs are injected to a fraction of sequences : following attributes are added: * list of motifs * list of masks * target variable (flag if motifs have been injected to a sequence)
numberOfSequences <- 10 sequencesLength <- 10 alphabet <- letters[1:4] motifs <- generate_motifs(alph, 3) # Although motifs list contain 3 independent motifs # only one will be injected # Since fraction = 0.2, motifs will be injected only into 2 sequences simulate_sequences(numberOfSequences, sequencesLength, alphabet, motifs, n_motifs = 1, fraction = 0.2)
generate_sequences
: wrapper for a simulate_sequences()
and place where n-grams are counted
: n-grams are counted by count_multimers()
function from seqR
package, which allows us to simulate data at terrific speed
Utilization of seqR::count_multimers()
is tricky, as we are supposed to deliver vector of all k-mers' lengths along with vector of gaps for each corresponding k-mer length.
create_simulation_data : function is a wrapper of functions implemented in sequence generation block.
# alphabet elements alph <- letters[1:6] # number of replication for each set of parameters (sequence length, number of motifs, etc.) reps <- 10 # Number of sequences to be generated n_seq <- c(10) # Sequences' length l_seq <- c(10) # Number of motifs that will be injected to a sequence n_motifs <- c(1, 2) # results will be saved in this location path <- "./" # triggers if matrices will be saved (debugging purposes) save_files <- FALSE # title of simulation title <- "SEQ" results <- create_simulation_data(reps, n_seq, l_seq, n_motifs, alph, path, title, save_files) results
On top of above example, create_simulation_data
allows us to specify probability vectors for both motifs and sequences. Furthermore, MD5 sums are being collected right after object save.
RDS files saved during simulations contain sparse matrix with n-gram counts along with target (column with 1/0 flag showing if given sequence has been injected) and two lists: motifs and their masks.
The purpose of this part comes from the need for creating vector of probabilites that have exact cosine similarity. Amyloidogenic and non-amyloidogenic sequences have different aa distribution. We would like to mimic this property by generating vector of aa probabilities that have previously observed cosine similarity.
euclidean_norm : returns euclidean norm of a vector
x <- runif(5) print(x) print(euclidean_norm(x))
cosine_similarity : return cosine similarity of two vectors
$$ cosine~similarity(u, v) = \frac{}{|u||v|} $$
x <- runif(5) y <- runif(5) print(x) print(y) print(cosine_similarity(x, y))
generate_probs : return two vectors of probabilites with given cosine similarity : note: as cosine similarity parameter approaches 0.5, some computational issues may arrive
size <- 6 cs <- 0.8 probabilites <- generate_probs(size, cs) print(probabilites) print(cosine_similarity(probabilites[1, ], probabilites[2, ]))
sequenceTransitionMatrix : function takes two arguments: matrix of sequences (sequences in rows), alphabet(Markov chain states) : returns markovchain class instance
alphabet <- letters[1:4] sequences <- matrix(sample(alphabet, size = 50, replace=TRUE), nrow = 5, ncol = 10) sequenceTransitionMatrix(sequences, alphabet)
calculate_seq_prob : function calculates probabiliity of given trajectory : takes two arguments: markovchain class instance and a sequence
Note: Inital state is not taken into considerations
alphabet <- letters[1:4] sequences <- matrix(sample(alphabet, size = 50, replace=TRUE), nrow = 5, ncol = 10) mc <- sequenceTransitionMatrix(sequences, alphabet) print(mc) example_seq <- sample(alphabet, size = 5, replace = TRUE) print(example_seq) calculate_seq_prob(mc, example_seq)
calculate_ngram_prob
: function is an extension for calculate_seq_prob
; we can compute probability of trajectories with gaps
alphabet <- letters[1:4] sequences <- matrix(sample(alphabet, size = 50, replace=TRUE), nrow = 5, ncol = 10) mc <- sequenceTransitionMatrix(sequences, alphabet) example_ngram <- sample(c(alphabet, "_"), size = 5, replace = TRUE) print(example_ngram) calculate_ngram_prob(mc, example_ngram)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.