In jakubkala/QuiPTsim:

knitr::opts_chunk$set(echo = TRUE)
library(QuiPTsim)
setwd("~/amylogram/QuiPTsim/vignettes/")

QuiPTsim contains source code used utilized in extensive study of Quick Permutation Test (QuiPT) properties. QuiPT have been further compared to other widely used feature selection algorithms (Hope this gonna happen).

Simulation data

Following steps have been carried out to create datasets:

Various alphabets have been defined. Sequences generated consist of both full alphabet and and their simplifications(from literature & domain knowledge of Mr Burdukiewicz).
Sequences have been generated both using uniform distribution of elements and previously computed element probabilites (AAs occurence count in AmpGram data). Sequence simplification was further based on precomputed AA probabilites.
For such sequences, motifs have been injected. Again, AA occurence probability was precomputed on AmpGram data. Worth nothing is that motif injection has been implemented so that motifs can overlap each other.
Each n-gram occurence matrix has been equipped with a list of motifs and their masks.

Simulation pipeline details

Function create_simulation_data() is a high-level wrapper of function implemented in QuiPTsim package as a data generation automation framework.

Results of single simulation iteration are saved in previously stated directory. N-gram occurence matrices (in RDS format) are saved along with master data frame in csv format that contains all the details about prepared dataset. Each row represents single matrix and its location is defined in path column.

Simulation highlights:

Each simulation has a constant number of sequences ($6000$) since this is an upper bound for most of datasets available. Smaller datasets can be achieved by subsampling all sequences.
Both full and simplified alphabets have been considered:
20-element alphabet with various probability vectors (based on AmpGram drake repository)
Simplified alphabets both from literature and Amylogram analysis with probabilites from AmpGram analysis(make sure this is true).
Each simulation has been repeated $100$ times
Positive sequences make $50%$ of full dataset.
Number of alphabet elements and total number of gaps in motifs are fixed in all considerations($n=4$ and $d=6$ give maximum length of motif 10).