Nothing
#' @title Functional Motif Discovery
#'
#' @description
#' The `discoverMotifs` function facilitates the discovery of recurring patterns, or motifs, within functional data by employing two sophisticated algorithms:
#' \code{ProbKMA} (Probabilistic K-means with Local Alignment) and \code{funBIalign}.
#' These algorithms are designed to identify and cluster functional motifs across multiple curves, leveraging advanced clustering and alignment techniques to handle complex data structures.
#'
#' \code{ProbKMA} integrates probabilistic clustering with local alignment strategies, enabling the detection of motifs that exhibit variability in both shape and position across different curves.
#' This method is particularly adept at handling noisy data and motifs that may appear at varying scales or locations within the curves.
#'
#' On the other hand, \code{funBIalign} utilizes hierarchical clustering based on mean squared residue scores to uncover motifs.
#' This approach effectively captures the additive nature of functional motifs, considering both portion-specific adjustments and time-varying components to accurately identify recurring patterns.
#'
#' By providing a flexible interface that accommodates different clustering paradigms, `discoverMotifs` empowers users to perform robust motif discovery tailored to their specific data characteristics and analytical requirements.
#' Whether opting for the probabilistic and alignment-focused \code{ProbKMA} or the hierarchical and residue-based \code{funBIalign}, users can leverage these methods to extract meaningful and interpretable motifs from their functional datasets.
#'
#' @details
#' The `discoverMotifs` function dynamically switches between two advanced motif discovery algorithms based on the user's specification.
#' Each algorithm employs distinct strategies to identify and cluster motifs within functional data, offering flexibility and adaptability to various analytical scenarios.
#'
#' @section Theoretical Background for ProbKMA:
#'
#' \code{ProbKMA} is inspired by methodologies prevalent in bioinformatics, particularly those involving local alignment techniques extended from high-similarity seeds.
#' This algorithm combines fuzzy clustering approaches with local alignment strategies to effectively minimize a generalized least squares functional.
#' The minimization process can incorporate both the levels and derivatives of the curves through a Sobolev-based distance metric, enhancing the algorithm's sensitivity to both shape and rate changes in the data.
#'
#' Throughout its iterative process, \code{ProbKMA} refines motif centers, membership probabilities, and alignment shifts, making it highly effective for capturing complex motif structures and motifs distributed across multiple curves.
#' This ensures that the discovered motifs are both representative and robust against variations and noise within the functional data.
#'
#' @section Theoretical Background for funBIalign:
#'
#' \code{funBIalign} models functional motifs as an additive combination of motif means, portion-specific adjustments, and time-varying components.
#' The algorithm constructs a hierarchical dendrogram utilizing the generalized mean squared residue score (fMSR) to identify candidate motifs across curves.
#'
#' A critical aspect of \code{funBIalign} is its post-processing step, which filters out redundant motifs and refines the final selection to ensure that only the most significant and representative motifs are retained.
#' This hierarchical approach allows for a nuanced identification of motifs, capturing both broad and subtle patterns within the data.
#'
#' @section Common Parameters:
#' The following parameters are common to both \code{ProbKMA} and \code{funBIalign} algorithms:
#' \describe{
#' \item{\code{Y0}}{A list containing \code{N} vectors (for univariate curves) or \code{N} matrices (for multivariate curves) representing the functional data. Each curve is evaluated on a uniform grid, ensuring consistency across the dataset.}
#' \item{\code{method}}{A character string specifying the motif discovery algorithm to use. Acceptable values are \code{"ProbKMA"} for Probabilistic K-means with Local Alignment and \code{"funBIalign"} for Functional Bi-directional Alignment.}
#' \item{\code{stopCriterion}}{A character string indicating the convergence criterion for the selected algorithm. For \code{ProbKMA}, options include \code{"max"}, \code{"mean"}, or \code{"quantile"} based on the Bhattacharyya distance between memberships in successive iterations. For \code{funBIalign}, options are \code{"fMRS"} (functional Mean Squared Residue) or \code{"Variance"} to guide the ranking of motifs.}
#' \item{\code{name}}{A character string specifying the name of the output directory where results will be saved. This facilitates organized storage and easy retrieval of analysis results.}
#' \item{\code{plot}}{A logical value indicating whether to generate and save plots of the discovered motifs and clustering results. When set to \code{TRUE}, visualizations are produced to aid in the qualitative assessment of the motif discovery process.}
#' \item{\code{worker_number}}{An integer specifying the number of CPU cores to utilize for parallel computations. By default, the function uses the total number of available cores minus one, optimizing computational efficiency without overloading the system.}
#' }
#'
#' @section ProbKMA Options:
#' The following parameters are specific to the \code{ProbKMA} algorithm:
#' \describe{
#' \item{\code{K}}{An integer or vector specifying the number of motifs to be discovered. It can be a single integer for uniform motif discovery or a vector for specifying different numbers of motifs.}
#' \item{\code{c}}{An integer or vector indicating the minimum motif lengths. This ensures that each discovered motif meets a specified minimum length requirement, maintaining the integrity of motif structures.}
#' \item{\code{c_max}}{An integer or vector specifying the maximum motif lengths, allowing control over the upper bounds of motif sizes to prevent excessively long motifs.}
#' \item{\code{diss}}{A character string defining the dissimilarity measure to use. Possible values include \code{"d0_L2"}, \code{"d1_L2"}, and \code{"d0_d1_L2"}, which determine how the algorithm quantifies differences between motifs based on level and derivative information.}
#' \item{\code{alpha}}{A numeric value between 0 and 1 that serves as a weight parameter between \code{d0_L2} and \code{d1_L2} when using \code{d0_d1_L2}. An \code{alpha} of 0 emphasizes \code{d0_L2}, while an \code{alpha} of 1 emphasizes \code{d1_L2}, allowing for balanced consideration of both metrics.}
#' \item{\code{w}}{A numeric vector specifying the weight for the dissimilarity index across different dimensions. All values must be positive, enabling the algorithm to prioritize certain dimensions over others based on their relative importance.}
#' \item{\code{m}}{A numeric value greater than 1 that acts as the weighting exponent in the least-squares functional method. This parameter influences the sensitivity of the algorithm to differences in motif alignment and membership probabilities.}
#' \item{\code{iter_max}}{An integer specifying the maximum number of iterations allowed for the algorithm to converge. This prevents excessive computation time by limiting the number of optimization steps.}
#' \item{\code{quantile}}{A numeric value representing the quantile probability used when \code{stopCriterion} is set to \code{"quantile"}. This determines the threshold for convergence based on the distribution of Bhattacharyya distances.}
#' \item{\code{tol}}{A numeric value specifying the tolerance level for convergence. The algorithm stops iterating if the change in the stop criterion falls below this threshold, ensuring precise and stable convergence.}
#' \item{\code{iter4elong}}{An integer indicating the number of iterations after which motif elongation is performed. If set to a value greater than \code{iter_max}, no elongation is performed. Motif elongation allows the algorithm to extend motifs to better fit the data.}
#' \item{\code{tol4elong}}{A numeric value defining the tolerance on the Bhattacharyya distance for motif elongation. This parameter controls how much the objective function can increase during elongation, ensuring that motif extensions do not degrade the overall fit.}
#' \item{\code{max_elong}}{A numeric value representing the maximum elongation allowed in a single iteration, expressed as a percentage of the motif length. This prevents excessive extension of motifs in any single step.}
#' \item{\code{trials_elong}}{An integer specifying the number of elongation trials (equispaced) on each side of the motif in a single iteration. Multiple trials enhance the robustness of motif elongation by exploring various extension possibilities.}
#' \item{\code{deltaJK_elong}}{A numeric value indicating the maximum relative increase in the objective function permitted during motif elongation. This ensures that elongation steps contribute positively to the motif fitting process.}
#' \item{\code{max_gap}}{A numeric value defining the maximum gap allowed in each alignment as a percentage of the motif length. This parameter controls the allowable discontinuity between aligned motifs, maintaining coherence in motif placement.}
#' \item{\code{iter4clean}}{An integer specifying the number of iterations after which motif cleaning is performed. If set to a value greater than \code{iter_max}, no cleaning is performed. Motif cleaning removes redundant or poorly fitting motifs to refine the final motif set.}
#' \item{\code{tol4clean}}{A numeric value representing the tolerance on the Bhattacharyya distance for motif cleaning. This parameter determines the threshold for identifying and removing redundant motifs during the cleaning process.}
#' \item{\code{quantile4clean}}{A numeric value specifying the dissimilarity quantile used for motif cleaning. This quantile determines which motifs are considered sufficiently dissimilar to be retained in the final set.}
#' \item{\code{return_options}}{A logical value indicating whether to return the options passed to the \code{ProbKMA} method. When set to \code{TRUE}, users receive detailed information about the algorithm's configuration, facilitating transparency and reproducibility.}
#' \item{\code{Y1}}{A list of derivative curves used if the dissimilarity measure \code{"d0_d1_L2"} is selected. These derivatives enhance the algorithm's ability to capture both shape and rate changes in the functional data.}
#' \item{\code{P0}}{An initial membership matrix (N x K), where N is the number of curves and K is the number of clusters. If set to \code{NULL}, a random matrix is generated, initiating the probabilistic clustering process.}
#' \item{\code{S0}}{An initial shift warping matrix (N x K). If set to \code{NULL}, a random matrix is generated to initialize the alignment process, allowing motifs to adapt to variations in the data.}
#' \item{\code{n_subcurves}}{An integer specifying the number of splitting subcurves used when the number of curves is equal to one. This parameter allows the algorithm to handle single-curve datasets by dividing them into manageable segments for motif discovery.}
#' \item{\code{sil_threshold}}{A numeric value representing the threshold to filter candidate motifs based on their silhouette scores. This ensures that only motifs with sufficient clustering quality are retained in the final results.}
#' \item{\code{set_seed}}{A logical value indicating whether to set a random seed for reproducibility. When set to \code{TRUE}, the function initializes the random number generator to ensure consistent results across multiple runs.}
#' \item{\code{seed}}{An integer specifying the random seed used for initialization when \code{set_seed} is \code{TRUE}. This parameter guarantees reproducibility of the clustering and alignment processes.}
#' \item{\code{exe_print}}{A logical value determining whether to print execution details for each iteration. When set to \code{TRUE}, users receive real-time feedback on the algorithm's progress, aiding in monitoring and debugging.}
#' \item{\code{V_init}}{A list of motif sets provided as specific initializations for clustering rather than using random initializations. The `V_init` parameter allows users to provide a set of motifs as starting points for the algorithm, instead of relying on random initialization. If `n_init` is specified as greater than the number of motifs given in `V_init`, the remaining initializations will be randomly generated. For example, if `n_init = 10` but only 5 motif sets are given in `V_init`, the algorithm will use these 5 initializations and generate an additional 5 randomly.}
#' \item{\code{transformed}}{A logical value indicating whether to normalize the curve segments to the interval [0,1] before applying the dissimilarity measure. Setting `transformed = TRUE` scales each curve segment between 0 and 1, which allows for the identification of motifs with consistent shapes but different amplitudes. This normalization is useful for cases where motif occurrences may vary in amplitude but have similar shapes, enabling better pattern recognition across diverse data scales.}
#' \item{\code{n_init_motif}}{The number of initial motif sets from `V_init` to be used directly as starting points in clustering. If `n_init_motif` is set to a value larger than the number of motifs provided in `V_init`, additional initializations will be generated randomly to meet the specified number. For example, if `n_init = 10` and `n_init_motif = 5` with only 3 motif sets in `V_init`, the algorithm will use these 3 sets and generate 7 additional random initializations.}
#' }
#'
#' @section funBIalign Options:
#' The following parameters are specific to the \code{funBIalign} algorithm:
#' \describe{
#' \item{\code{portion_len}}{An integer specifying the length of curve portions to align. This parameter controls the granularity of alignment, allowing the algorithm to focus on specific segments of the curves for motif discovery.}
#' \item{\code{min_card}}{An integer representing the minimum cardinality of motifs, i.e., the minimum number of motif occurrences required for a motif to be considered valid. This ensures that only motifs with sufficient representation across the dataset are retained.}
#' \item{\code{cut_off}}{A double that specifies the number of top-ranked motifs to keep based on the ranking criteria, facilitating focused visualization of the most significant motifs.
#' In particular, all motifs that rank below the cut_off are retained.}
#' }
#'
#' @param Y0 A list containing N vectors (for univariate curves) or N matrices (for multivariate curves) representing the functional data.
#' @param method A character string specifying the motif discovery algorithm to use. Acceptable values are "ProbKMA" for Probabilistic K-means with Local Alignment and "funBIalign" for Functional Bi-directional Alignment.
#' @param stopCriterion A character string indicating the convergence criterion for the selected algorithm.
#' @param name A character string specifying the name of the output directory where results will be saved.
#' @param plot A logical value indicating whether to generate and save plots of the discovered motifs and clustering results.
#' @param probKMA_options A list of options specific to the ProbKMA algorithm.
#' @param funBIalign_options A list of options specific to the funBIalign algorithm.
#' @param worker_number An integer specifying the number of CPU cores to utilize for parallel computations.
#'
#' @return
#' A list containing the discovered motifs and their corresponding statistics, tailored to the selected method:
#' \describe{
#' \item{\code{motifs}}{A list of identified motifs, each containing the motif's representative curve, membership probabilities, and alignment information.}
#' \item{\code{statistics}}{Detailed statistics for each motif, including measures such as silhouette scores, variance explained, and other relevant metrics that quantify the quality and significance of the discovered motifs.}
#' \item{\code{parameters}}{The final parameters and configurations used during the motif discovery process, providing transparency and facilitating reproducibility of the results.}
#' \item{\code{plots}}{If \code{plot = TRUE}, this component contains the generated plots visualizing the motifs and their distribution across the functional data.}
#' }
#'
#' @examples
#' \donttest{
#' # Example 1: Discover motifs using ProbKMA
#'
#' # Define dissimilarity measure and weight parameter
#' diss <- 'd0_d1_L2'
#' alpha <- 0.5
#'
#' # Define number of motifs and their minimum lengths
#' K <- c(2, 3)
#' c <- c(61, 51)
#' n_init <- 10
#'
#' # Load simulated data
#' data("simulated200")
#'
#' # Perform motif discovery using ProbKMA
#' results <- funMoDisco::discoverMotifs(
#' Y0 = simulated200$Y0,
#' method = "ProbKMA",
#' stopCriterion = "max",
#' name = tempdir(),
#' plot = TRUE,
#' probKMA_options = list(
#' Y1 = simulated200$Y1,
#' K = K,
#' c = c,
#' n_init = n_init,
#' diss = diss,
#' alpha = alpha
#' ),
#' worker_number = NULL
#' )
#'
#' # Modify silhouette threshold and re-run post-processing
#' results <- funMoDisco::discoverMotifs(
#' Y0 = simulated200$Y0,
#' method = "ProbKMA",
#' stopCriterion = "max",
#' name = tempdir(),
#' plot = TRUE,
#' probKMA_options = list(
#' Y1 = simulated200$Y1,
#' K = K,
#' c = c,
#' n_init = n_init,
#' diss = diss,
#' alpha = alpha,
#' sil_threshold = 0.5
#' ),
#' worker_number = NULL
#' )
#'
#' # Example 2: Discover motifs using funBIalign
#' results_funbialign <- funMoDisco::discoverMotifs(
#' Y0 = simulated200$Y0,
#' method = "funBIalign",
#' stopCriterion = 'Variance',
#' name = tempdir(),
#' plot = TRUE,
#' funBIalign_options = list(
#' portion_len = 60,
#' min_card = 3,
#' cut_off = 1.0
#' )
#' )
#' }
#'
#' @seealso
#' \strong{ProbKMA}:
#' \href{https://arxiv.org/pdf/1808.04773}{Probabilistic K-means with Local Alignment} \cr
#' \strong{funBIalign}:
#' \href{https://arxiv.org/pdf/2306.04254}{Hierarchical Clustering with Mean Squared Residue Scores}.
#'
#' @export
discoverMotifs <- function(Y0,method,stopCriterion,name,plot,
probKMA_options = list(),
funBIalign_options = list(portion_len = NULL,min_card = NULL,cut_off=NULL),
worker_number = NULL){
oldpar <- par(no.readonly = TRUE)
on.exit(par(oldpar))
### set parallel jobs #############################################################################
core_number <- parallel::detectCores()
# check worker number
if(!is.null(worker_number)){
if(!is.numeric(worker_number)){
warning('Worker number not valid. Selecting default number.')
worker_number=NULL
}else{
if((worker_number%%1!=0)|(worker_number<1)|(worker_number>core_number)){
warning('Worker number not valid. Selecting default number.')
worker_number=NULL
}
}
} else {
worker_number <- core_number-1
}
rm(core_number)
if(method == 'ProbKMA')
{
pb <- progress_bar$new(
format = "Progress [:bar] :percent | Elapsed: :elapsed | ETA: :eta",
total = 5,
width = 50
)
default_probKMA_options <- list(K=NULL,c=NULL,diss="d0_L2",alpha=0,
Y1=NULL,P0=matrix(),S0=matrix(),
n_init=10,return_options=TRUE,names_var='x(t)',
standardize=FALSE,c_max=Inf,
iter_max=1e3,iter4elong=100,
trials_elong=201,max_gap=0.2,
quantile=0.25,tol=1e-8,
tol4elong=1e-3,max_elong=0.5,
deltaJK_elong=0.05,iter4clean=50,
tol4clean=1e-4,quantile4clean=0.5,
m=2,w=1,transformed = FALSE,silhouette_align=FALSE,
n_subcurves = 10,sil_threshold=0.9,
seed = 1,exe_print = FALSE,set_seed = FALSE,
V_init=NULL, n_init_motif = 0)
probKMA_options <- modifyList(default_probKMA_options, probKMA_options)
Y1 <- probKMA_options$Y1
P0 <- probKMA_options$P0
S0 <- probKMA_options$S0
K <- probKMA_options$K
c <- probKMA_options$c
n_init <- probKMA_options$n_init
diss <- probKMA_options$diss
alpha <- probKMA_options$alpha
return_options <- probKMA_options$return_options
names_var <- probKMA_options$names_var
seed <- probKMA_options$seed
exe_print <- probKMA_options$exe_print
set_seed <- probKMA_options$set_seed
V_init <- probKMA_options$V_init
n_init_motif <- probKMA_options$n_init_motif
### check input #############################################################################################
# check name
if(!is.character(name))
stop('name should be a string.')
if(grepl(' ',name)){
warning('name contains spaces. Changing spacing in underscores.')
name=gsub(" ","_",name,fixed=TRUE)
}
if (substr(name, nchar(name), nchar(name)) != "/") {
name <- paste0(name, "/")
}
files = list.files(name)
if('find_candidate_motifs_results.RData' %in% files) {
# candidate motifs already present, load them
load(paste0(name,'find_candidate_motifs_results.RData'))
} else {
### check on n_init_motif (to be checked) ############
if(n_init_motif > n_init)
{
warning('n_init_motif greater than n_init: setting n_init_motif = n_init')
n_init_motif = n_init
}
if(n_init_motif%%1!=0)
stop('Number of initializations n_init_motif should be an integer.')
if(n_init_motif<0)
stop('Number of initializations n_init_motif should be positive.')
######################################################
# check required input
if(missing(K))
stop('K must be specified')
if(missing(c))
stop('c must be specified')
# check K
if(!is.vector(K))
stop('K should be a vector.')
# check c
if(!is.vector(c))
stop('c should be a vector.')
# check n_init
if(n_init%%1!=0)
stop('Number of initializations n_init should be an integer.')
if(n_init<1)
stop('Number of initializations n_init should be at least 1.')
#### new check V_init #########################################
d = ncol(Y0[[1]])
if(n_init_motif > 0)
{
for(i in 1:length(K))
{
for(j in 1:length(c))
{
if(length(V_init[[i]][[j]]) < n_init_motif)
{
stop('More initial motifs needed')
}
else
{
if(length(V_init[[i]][[j]]) > n_init_motif)
{
warning('Initial motifs passed more than necessary. Resize V_init to the correct size!')
V_init[[i]][[j]] <- V_init[[i]][[j]][1:n_init_motif]
}
for(e in 1:(n_init-n_init_motif))
{
V_init[[i]][[j]] <- append(V_init[[i]][[j]], list(NULL))
}
for(k in 1:n_init_motif)
{
if(length(V_init[[i]][[j]][[k]])!= K[i])
{
stop('Number of initial motifs differs from K')
}
else
{
for(h in 1:K[i])
{
if(nrow(V_init[[i]][[j]][[k]][[h]]$v0)!= c[j] || ncol(V_init[[i]][[j]][[k]][[h]]$v0)!= d)
{
stop('Uncorrect dimensions of initial motifs provided')
}
else
{
if(probKMA_options$diss == "d0_d1_L2" || probKMA_options$diss == "d1_L2")
{
if(nrow(V_init[[i]][[j]][[k]][[h]]$v1)!= c[j] || ncol(V_init[[i]][[j]][[k]][[h]]$v0)!= d)
{
stop('Uncorrect dimensions of initial motifs provided')
}
}
V_init_bool = TRUE
}
}
}
}
}
}
}
}
##################################
# check Y0
if(missing(Y0))
stop('Y0 must be specified.')
if(!is.null(Y0))
if(!is.list(Y0))
stop('Y0 should be a list of vectors or matrices.')
if((FALSE %in% lapply(Y0,is.matrix))&&(FALSE %in% lapply(Y0,is.vector)))
stop('Y0 should be a list of vectors or matrices.')
N=length(Y0) # number of curves
if(N!=1 && N<5)
stop('More curves y_i(x) needed.')
Y0=lapply(Y0,as.matrix)
# set return_options=TRUE
if(!is.null(probKMA_options$return_options)){
if(probKMA_options$return_options==FALSE){
warning('Setting return_option=TRUE')
probKMA_options$return_options=TRUE
}
}
arguments = list(Y0 = Y0, Y1 = Y1, P0 = P0, S0 = S0,
standardize = probKMA_options$standardize, c_max = probKMA_options$c_max,
iter_max = probKMA_options$iter_max, iter4elong = probKMA_options$iter4elong,
trials_elong = probKMA_options$trials_elong, return_options = return_options,
alpha = alpha, max_gap = probKMA_options$max_gap,
quantile = probKMA_options$quantile, stopCriterion = stopCriterion, tol = probKMA_options$tol,
tol4elong =probKMA_options$tol4elong,
max_elong = probKMA_options$max_elong, deltaJK_elong = probKMA_options$deltaJK_elong,
iter4clean = probKMA_options$iter4clean,
tol4clean = probKMA_options$tol4clean, m = probKMA_options$m, w = probKMA_options$w,
seed = seed, K = NULL, c = NULL,
quantile4clean = probKMA_options$quantile4clean,
exe_print = exe_print, set_seed = set_seed,
diss = diss, transformed = probKMA_options$transformed, V_init = NULL,
align = probKMA_options$silhouette_align,
n_threads = worker_number)
if(worker_number>1){
cl_find=parallel::makeCluster(worker_number,timeout=60*60*24*30)
parallel::clusterExport(cl_find,c('name','names_var',
'probKMA_plot','probKMA_silhouette_plot',
'.mapply_custom','.diss_d0_d1_L2','.domain',
'.select_domain','.find_min_diss',
'probKMA_wrap','arguments'),envir=environment())
parallel::clusterCall(cl_find, function() {
combinat::combn
})
on.exit(parallel::stopCluster(cl_find))
}else{
cl_find=NULL
}
pb$update(0.1)
### run probKMA ##########################################################################################
i_c_K = expand.grid(seq_len(n_init),c,K)
vector_seed = seq(1,length(i_c_K$Var1))
if(n_init_motif > 0 && V_init_bool)
{
V_init_unlist=unlist(unlist(V_init,recursive=FALSE),recursive=FALSE)
results=tryCatch({.mapply_custom(cl_find,function(K,c,i,v_init,small_seed){
dir.create(paste0(name,"K",K,"_c",c),showWarnings=FALSE,recursive = TRUE)
files=list.files(paste0(name,"K",K,"_c",c))
if(paste0('random',i,'.RData') %in% files){
load(paste0(name,"K",K,"_c",c,'/random',i,'.RData'))
message("K",K,"_c",c,'_random',i,' loaded')
return(list(probKMA_results=probKMA_results,
time=time,silhouette=silhouette))
}else{
iter=iter_max=1
message("K",K,"_c",c,'_random',i,' running')
while(iter==iter_max){
start=proc.time()
if(arguments$set_seed)
{
arguments$seed = small_seed
set.seed(small_seed)
}
small_seed = small_seed + 1
arguments$K = K
arguments$c = c
arguments$quantile4clean = 1/K
arguments$V_init = v_init
results = do.call(probKMA_wrap,arguments)
probKMA_results = results[[1]]
silhouette_results = results[[2]]
end=proc.time()
time=end-start
iter=probKMA_results$iter
iter_max=arguments$iter_max
if(iter==iter_max)
warning('Maximum number of iteration reached. Re-starting.')
}
if(plot) {
transformed=probKMA_results$transformed
pdf(paste0(name,"K",K,"_c",c,'/random',i,'.pdf'),width=20,height=10)
probKMA_plot(probKMA_results,ylab=names_var,plot = plot,cleaned=FALSE, transformed = arguments$transformed)
dev.off()
pdf(paste0(name,"K",K,"_c",c,'/random',i,'clean.pdf'),width=20,height=10)
probKMA_plot(probKMA_results,ylab=names_var,sil_avg = silhouette_results[[4]],plot = plot,cleaned=TRUE, transformed = arguments$transformed)
dev.off()
pdf(paste0(name,"K",K,"_c",c,'/random',i,'silhouette.pdf'),width=7,height=10)
silhouette = probKMA_silhouette_plot(silhouette_results,K,plot = plot)
dev.off()
} else {
silhouette = probKMA_silhouette_plot(silhouette_results,K,plot = FALSE)
}
save(probKMA_results,time,silhouette,
file=paste0(name,"K",K,"_c",c,'/random',i,'.RData'))
return(list(probKMA_results=probKMA_results,
time=time,silhouette=silhouette))
}
},i_c_K[,3],i_c_K[,2],i_c_K[,1],V_init_unlist,vector_seed,SIMPLIFY=FALSE)},error = function(e){
stop(e$message)
})
}
if(n_init_motif == 0 || !V_init_bool)
{
if(N == 1) {
.split_curve_randomly <- function(curve, expected_motif_length, n_subcurves, randomness_factor = 0.3) {
length_curve <- nrow(curve) # Number of observations in the curve
# Step 1: Randomly generate candidate split points
candidate_splits <- sort(sample(seq(expected_motif_length, length_curve - expected_motif_length, 1), n_subcurves * 2))
# Step 2: Adjust split points with randomness while respecting the minimum length requirement
split_points <- c(1) # Start point
last_split <- 1
for (i in candidate_splits) {
# Check if we can add a split while respecting minimum length
if (i - last_split >= expected_motif_length) {
segment <- curve[last_split:i, ]
valid_indices <- which(!is.na(as.matrix(segment)[, 1])) # Indices of valid values
if (length(valid_indices) > 0) {
valid_segments <- split(valid_indices, cumsum(c(1, diff(valid_indices) != 1))) # Split into contiguous valid segments
valid_segment <- FALSE
for (seg in valid_segments) {
if (length(seg) >= expected_motif_length) {
valid_segment <- TRUE
break
}
}
# Introduce randomness in adding split points
if (valid_segment && runif(1) < randomness_factor) {
split_points <- c(split_points, i)
last_split <- i
}
}
}
}
# Add the last segment if it meets the minimum length requirement
if (length_curve - last_split >= expected_motif_length) {
split_points <- c(split_points, length_curve)
} else {
split_points[length(split_points)] <- length_curve
}
split_points <- unique(split_points)
# Step 3: Adjust the number of split points to match n_subcurves
if (length(split_points) - 1 > n_subcurves) {
# Too many splits, reduce by merging closest split points
while (length(split_points) - 1 > n_subcurves) {
gaps <- diff(split_points)
min_gap_index <- which.min(gaps)
split_points <- split_points[-(min_gap_index + 1)] # Remove the smaller split
}
} else if (length(split_points) - 1 < n_subcurves) {
# Too few splits, add new split points in the largest gaps between existing points
additional_splits_needed <- n_subcurves - (length(split_points) - 1)
while (additional_splits_needed > 0) {
gaps <- diff(split_points)
max_gap_index <- which.max(gaps)
new_split <- round(mean(split_points[max_gap_index:(max_gap_index + 1)]))
if (new_split - split_points[max_gap_index] >= expected_motif_length &&
split_points[max_gap_index + 1] - new_split >= expected_motif_length) {
split_points <- sort(c(split_points, new_split))
additional_splits_needed <- additional_splits_needed - 1
} else {
stop("Failed to find valid split points. Ensure each subcurve meets the minimum length requirement.")
}
}
}
# Step 4: Create subcurves based on valid split points
subcurves <- list()
for (i in 1:(length(split_points) - 1)) {
subcurve <- curve[split_points[i]:split_points[i + 1], ]
if (nrow(as.matrix(subcurve)) > 0) {
subcurves[[paste0("c", i)]] <- subcurve
}
}
return(list(subcurves = subcurves, split_points = split_points))
}
n_subcurves <- probKMA_options$n_subcurves
if(arguments$diss == 'd0_d1_L2') {
split_results <- lapply(seq_len(n_init),function(null){.split_curve_randomly(arguments$Y0[[1]], max(c),n_subcurves, randomness_factor = 0.3)})
Y0_subcurves <- lapply(split_results,function(splitted_curve){splitted_curve[[1]]})
Y1_subcurves <- lapply(seq_len(n_init),function(j){
temp <- list()
for (i in seq_along(split_results$split_points)[-length(split_results$split_points)])
temp[[i]] <- split_results[[j]][[1]][pos[i]:pos[i + 1]]
return(temp)
})
}
else if(arguments$diss == 'd0_L2') {
split_results <- lapply(seq_len(n_init),function(null){.split_curve_randomly(arguments$Y0[[1]], max(c),n_subcurves, randomness_factor = 0.3)})
Y0_subcurves <- lapply(split_results,function(splitted_curve){splitted_curve[[1]]})
Y1_subcurves <- vector("list",n_init)
}
else if(arguments$diss == 'd1_L2') {
Y0_subcurves <- vector("list",n_init)
split_results <- lapply(seq_len(n_init),function(null){.split_curve_randomly(arguments$Y1[[1]], max(c),n_subcurves, randomness_factor = 0.3)})
Y1_subcurves <- lapply(split_results,function(splitted_curve){splitted_curve[[1]]})
}
results=tryCatch({.mapply_custom(cl_find, function(K,c,i,Y0_curve,Y1_curve,small_seed){
dir.create(paste0(name,"K",K,"_c",c),showWarnings=TRUE,recursive = TRUE)
files=list.files(paste0(name,"K",K,"_c",c))
if(paste0('random',i,'.RData') %in% files){
load(paste0(name,"K",K,"_c",c,'/random',i,'.RData'))
message(paste("\033[34mK", K, "_c", c, "_random", i, "loaded\033[0m"))
return(list(probKMA_results=probKMA_results,
time=time,silhouette=silhouette))
}else{
iter=iter_max=1
message(paste("\033[34mK", K, "_c", c, "_random", i, "running\033[0m"))
if(arguments$set_seed)
{
arguments$seed = small_seed
set.seed(small_seed)
}
while(iter==iter_max){
start=proc.time()
small_seed = small_seed + 1
arguments$Y0 <- Y0_curve
arguments$Y1 <- Y1_curve
arguments$K = K
arguments$c = c
arguments$quantile4clean = 1/K
results = do.call(probKMA_wrap,arguments)
probKMA_results = results[[1]]
silhouette_results = results[[2]]
end=proc.time()
time=end-start
iter=probKMA_results$iter
iter_max=arguments$iter_max
if(iter==iter_max)
warning('Maximum number of iteration reached. Re-starting.')
}
if(plot) {
pdf(paste0(name,"K",K,"_c",c,'/random',i,'.pdf'),width=20,height=10)
probKMA_plot(probKMA_results,
plot = TRUE,
ylab=names_var,
sil_avg = silhouette_results[[4]],
cleaned=FALSE,
transformed = arguments$transformed)
dev.off()
pdf(paste0(name,"K",K,"_c",c,'/random',i,'clean.pdf'),width=20,height=10)
probKMA_plot(probKMA_results,
plot = TRUE,
ylab=names_var,
sil_avg = silhouette_results[[4]],
cleaned=TRUE,
transformed = arguments$transformed)
dev.off()
pdf(paste0(name,"K",K,"_c",c,'/random',i,'silhouette.pdf'),width=7,height=10)
silhouette = probKMA_silhouette_plot(silhouette_results,K,plot = plot)
dev.off()
} else {
silhouette = probKMA_silhouette_plot(silhouette_results,K,plot = FALSE)
}
probKMA_results["V_init"] <- NULL # delete V_init components
save(probKMA_results,time,silhouette,
file=paste0(name,"K",K,"_c",c,'/random',i,'.RData'))
return(list(probKMA_results=probKMA_results,
time=time,silhouette=silhouette))
}
},i_c_K[,3],i_c_K[,2],i_c_K[,1],Y0_subcurves,Y1_subcurves,vector_seed,SIMPLIFY=FALSE)},error = function(e){
stop(e$message)
})
} else {
results=tryCatch({.mapply_custom(cl_find, function(K,c,i,small_seed){
dir.create(paste0(name,"K",K,"_c",c),showWarnings=TRUE,recursive = TRUE)
files=list.files(paste0(name,"K",K,"_c",c))
if(paste0('random',i,'.RData') %in% files){
load(paste0(name,"K",K,"_c",c,'/random',i,'.RData'))
message(paste("\033[34mK", K, "_c", c, "_random", i, "loaded\033[0m"))
return(list(probKMA_results=probKMA_results,
time=time,silhouette=silhouette))
}else{
iter=iter_max=1
message(paste("\033[34mK", K, "_c", c, "_random", i, "running\033[0m"))
if(arguments$set_seed)
{
arguments$seed = small_seed
set.seed(small_seed)
}
while(iter==iter_max){
start=proc.time()
small_seed = small_seed + 1
arguments$K = K
arguments$c = c
arguments$quantile4clean = 1/K
results = do.call(probKMA_wrap,arguments)
probKMA_results = results[[1]]
silhouette_results = results[[2]]
end=proc.time()
time=end-start
iter=probKMA_results$iter
iter_max=arguments$iter_max
if(iter==iter_max)
warning('Maximum number of iteration reached. Re-starting.')
}
if(plot) {
pdf(paste0(name,"K",K,"_c",c,'/random',i,'.pdf'),width=20,height=10)
probKMA_plot(probKMA_results,
plot = TRUE,
ylab=names_var,
sil_avg = silhouette_results[[4]],
cleaned=FALSE,
transformed = arguments$transformed)
dev.off()
pdf(paste0(name,"K",K,"_c",c,'/random',i,'clean.pdf'),width=20,height=10)
probKMA_plot(probKMA_results,
plot = TRUE,
ylab=names_var,
sil_avg = silhouette_results[[4]],
cleaned=TRUE,
transformed = arguments$transformed)
dev.off()
pdf(paste0(name,"K",K,"_c",c,'/random',i,'silhouette.pdf'),width=7,height=10)
silhouette = probKMA_silhouette_plot(silhouette_results,K,plot = plot)
dev.off()
} else {
silhouette = probKMA_silhouette_plot(silhouette_results,K,plot = FALSE)
}
probKMA_results["V_init"] <- NULL # delete V_init components
save(probKMA_results,time,silhouette,
file=paste0(name,"K",K,"_c",c,'/random',i,'.RData'))
return(list(probKMA_results=probKMA_results,
time=time,silhouette=silhouette))
}
},i_c_K[,3],i_c_K[,2],i_c_K[,1],vector_seed,SIMPLIFY=FALSE)},error = function(e){
stop(e$message)
})
}
}
results=split(results,list(factor(i_c_K[,2],c),factor(i_c_K[,3],K)))
results=split(results,rep(K,each=length(c)))
### plot silhouette average #################################################################################
silhouette_average_sd=lapply(results,
function(results){
silhouette_average=lapply(results,
function(results){
silhouette_average=numeric(n_init)
silhouette_sd=numeric(n_init)
for(i in seq_len(n_init)){
silhouette_average[i]=mean(results[[i]]$silhouette$silhouette_average)
silhouette_sd[i]=sd(results[[i]]$silhouette$silhouette_average)
}
return(cbind(silhouette_average,silhouette_sd))
})
})
if(plot){
pdf(paste0(name,'silhouette.pdf'),width=7,height=5)
for(i in seq_along(K)){
silhouette_average_plot=matrix(unlist(lapply(silhouette_average_sd[[i]],function(average_sd) average_sd[order(average_sd[,1],decreasing=TRUE),1])),ncol=length(c))
silhouette_sd_plot=matrix(unlist(lapply(silhouette_average_sd[[i]],function(average_sd) average_sd[order(average_sd[,1],decreasing=TRUE),2])),ncol=length(c))
silhouette_order=matrix(unlist(lapply(silhouette_average_sd[[i]],function(average_sd) order(average_sd[,1],decreasing=TRUE))),ncol=length(c))
matplot(matrix(rep(1:n_init,length(c))+rep(seq(-0.1,0.1,length.out=length(c)),each=n_init),ncol=length(c)),
silhouette_average_plot,type='b',pch=16,lty=1,col=1+seq_along(c),ylim=c(0,1),xaxt='n',
xlab='',ylab='Silhouette average',main=paste0('K=',K[i]))
shift=seq(-0.1,0.1,length.out=length(c))
for(ii in seq_along(c)){
segments(1:n_init+shift[ii],silhouette_average_plot[,ii]-silhouette_sd_plot[,ii],
1:n_init+shift[ii],silhouette_average_plot[,ii]+silhouette_sd_plot[,ii],col=ii+1)
segments(1:n_init+shift[ii]-0.1,silhouette_average_plot[,ii]+silhouette_sd_plot[,ii],
1:n_init+shift[ii]+0.1,silhouette_average_plot[,ii]+silhouette_sd_plot[,ii],col=ii+1)
segments(1:n_init+shift[ii]-0.1,silhouette_average_plot[,ii]-silhouette_sd_plot[,ii],
1:n_init+shift[ii]+0.1,silhouette_average_plot[,ii]-silhouette_sd_plot[,ii],col=ii+1)
text(1:n_init+shift[ii],silhouette_average_plot[,ii]-silhouette_sd_plot[,ii],
silhouette_order[,ii],col=ii+1,pos=1,cex=0.7)
}
legend('bottomleft',legend=paste0('c=',c),col=1+seq_along(c),lty=1)
axis(1,1:n_init,labels=rep('',n_init))
mtext("Init",side=1,line=1)
}
dev.off()
}
### plot processing time ####################################################################################
times=lapply(results,
function(results){
times=lapply(results,
function(results)
times=unlist(lapply(results,function(results) results$time[3])))
})
if(plot){
pdf(paste0(name,'times.pdf'),width=7,height=5)
y_max=max(unlist(times))*1.2
times_plot=vector('list',length(K))
for(i in seq_along(K)){
times_plot[[i]]=Reduce(cbind,
lapply(seq_along(c),
function(j)
as.matrix(times[[i]][[j]][order(silhouette_average_sd[[i]][[j]][,1],decreasing=TRUE)]))
)
silhouette_order=matrix(unlist(lapply(silhouette_average_sd[[i]],function(average_sd) order(average_sd[,1],decreasing=TRUE))),ncol=length(c))
matplot(matrix(rep(1:n_init,length(c))+rep(seq(-0.1,0.1,length.out=length(c)),each=n_init),ncol=length(c)),
times_plot[[i]],type='b',pch=16,lty=1,col=1+seq_along(c),ylim=c(0,y_max),xaxt='n',
xlab='',ylab='Time',main=paste0('K=',K[i]))
shift=seq(-0.1,0.1,length.out=length(c))
for(ii in seq_along(c)){
text(1:n_init+shift[ii],times_plot[[i]][,ii],silhouette_order[,ii],col=ii+1,pos=1,cex=0.7)
}
legend('topleft',legend=paste0('c=',c),col=1+seq_along(c),lty=1)
axis(1,1:n_init,labels=rep('',n_init))
mtext("Init",side=1,line=1)
}
for(i in seq_along(K)){
boxplot(times_plot[[i]],col=1+seq_along(c),names=paste0('c=',c),ylim=c(0,y_max),
xlab='',ylab='Times',main=paste0('K=',K[i]))
}
dev.off()
}
### plot dissimilarities ####################################################################################
if(plot){
D=lapply(results,
function(results){
D=lapply(results,
function(results){
D=lapply(results,
function(results){
D=as.vector(results$probKMA_results$D)
})
})
})
pdf(paste0(name,'dissimilarities.pdf'),width=7,height=5)
y_max=max(unlist(D))
for(i in seq_along(K)){
D_plot=matrix(unlist(D[[i]]),ncol=length(c))
boxplot(D_plot,col=1+seq_along(c),names=paste0('c=',c),ylim=c(0,y_max),
xlab='',ylab='Dissimilarity',main=paste0('K=',K[i]))
}
for(i in seq_along(K)){
for(j in seq_along(c)){
silhouette_order=order(silhouette_average_sd[[i]][[j]][,1],decreasing=TRUE)
D_plot=D[[i]][[j]][silhouette_order]
boxplot(D_plot,col=j+1,ylim=c(0,y_max),names=silhouette_order,
xaxt='n',ylab='Dissimilarity',main=paste0('K=',K[i],' c=',c[j]))
axis(1,1:n_init,labels=rep('',n_init))
mtext("Init",side=1,line=1)
}
}
dev.off()
D_clean=lapply(results,
function(results){
D_clean=lapply(results,
function(results){
D_clean=lapply(results,
function(results){
D_clean=as.vector(results$probKMA_results$D_clean)
})
})
})
pdf(paste0(name,'dissimilarities_clean.pdf'),width=7,height=5)
y_max=max(unlist(D_clean))
for(i in seq_along(K)){
D_plot=matrix(unlist(D_clean[[i]]),ncol=length(c))
boxplot(D_plot,col=1+seq_along(c),names=paste0('c=',c),ylim=c(0,y_max),
xlab='',ylab='Dissimilarity',main=paste0('K=',K[i]))
}
for(i in seq_along(K)){
for(j in seq_along(c)){
silhouette_order=order(silhouette_average_sd[[i]][[j]][,1],decreasing=TRUE)
D_plot=D_clean[[i]][[j]][order(silhouette_average_sd[[i]][[j]][,1],decreasing=TRUE)]
boxplot(D_plot,col=j+1,ylim=c(0,y_max),names=silhouette_order,
xaxt='n',ylab='Dissimilarity',main=paste0('K=',K[i],' c=',c[j]))
axis(1,1:n_init,labels=rep('',n_init))
mtext("Init",side=1,line=1)
}
}
dev.off()
}
### plot motif lengths ######################################################################################
if(plot){
motif_length=mapply(function(results){
motif_length=mapply(function(results){
motif_length=as.matrix(Reduce(cbind,lapply(results,
function(results){
unlist(lapply(results$probKMA_results$V0,nrow))
})))
return(as.matrix(motif_length))
},results,SIMPLIFY=FALSE)
},results,SIMPLIFY=FALSE)
pdf(paste0(name,'lengths.pdf'),width=7,height=5)
motif_length_plot=lapply(motif_length,
function(motif_length){
lapply(motif_length,
function(motif_length){
motif_length=motif_length+matrix(rnorm(nrow(motif_length)*n_init,mean=0,sd=0.05),ncol=n_init)
})
})
ymax=max(unlist(motif_length_plot))
for(i in seq_along(K)){
plot(1:n_init,type="n",xaxt='n',xlab='',ylim=c(-1,ymax),ylab='Motif lengths',main=paste0('K=',K[i]))
abline(h=c,col=1+seq_along(c))
axis(1,1:n_init,labels=rep('',n_init))
mtext("Init",side=1,line=1)
shift=seq(-0.1,0.1,length.out=length(c))
silhouette_order=matrix(unlist(lapply(silhouette_average_sd[[i]],function(average_sd) order(average_sd[,1],decreasing=TRUE))),ncol=length(c))
for(ii in seq_along(c)){
points(rep(1:n_init+shift[ii],each=K[i]),
motif_length_plot[[i]][[ii]][,order(silhouette_average_sd[[i]][[ii]][,1],decreasing=TRUE)],col=ii+1,pch=8)
text(rep(1:n_init+shift[ii],each=K[i]),motif_length[[i]][[ii]][,order(silhouette_average_sd[[i]][[ii]][,1],decreasing=TRUE)],
rep(silhouette_order[,ii],each=K[i]),col=ii+1,pos=1,cex=0.7)
}
legend('bottomleft',legend=paste0('c=',c),col=1+seq_along(c),pch=8)
}
dev.off()
motif_clean_length=mapply(function(results){
motif_length=mapply(function(results){
motif_length=as.matrix(Reduce(cbind,lapply(results,
function(results){
unlist(lapply(results$probKMA_results$V0_clean,nrow))
})))
return(as.matrix(motif_length))
},results,SIMPLIFY=FALSE)
},results,SIMPLIFY=FALSE)
pdf(paste0(name,'lengths_clean.pdf'),width=7,height=5)
motif_length_plot=lapply(motif_clean_length,
function(motif_length){
lapply(motif_length,
function(motif_length){
motif_length=motif_length+matrix(rnorm(nrow(motif_length)*n_init,mean=0,sd=0.05),ncol=n_init)
})
})
ymax=max(unlist(motif_length_plot))
for(i in seq_along(K)){
plot(1:n_init,type="n",xaxt='n',xlab='',ylim=c(-1,ymax),ylab='Motif lengths',main=paste0('K=',K[i]))
abline(h=c,col=1+seq_along(c))
axis(1,1:n_init,labels=rep('',n_init))
mtext("Init",side=1,line=1)
shift=seq(-0.1,0.1,length.out=length(c))
silhouette_order=matrix(unlist(lapply(silhouette_average_sd[[i]],function(average_sd) order(average_sd[,1],decreasing=TRUE))),ncol=length(c))
for(ii in seq_along(c)){
points(rep(1:n_init+shift[ii],each=K[i]),
motif_length_plot[[i]][[ii]][,order(silhouette_average_sd[[i]][[ii]][,1],decreasing=TRUE)],col=ii+1,pch=8)
text(rep(1:n_init+shift[ii],each=K[i]),motif_clean_length[[i]][[ii]][,order(silhouette_average_sd[[i]][[ii]][,1],decreasing=TRUE)],
rep(silhouette_order[,ii],each=K[i]),col=ii+1,pos=1,cex=0.7)
}
legend('bottomleft',legend=paste0('c=',c),col=1+seq_along(c),pch=8)
}
dev.off()
pdf(paste0(name,'lengths_perc.pdf'),width=7,height=5)
motif_length_perc=mapply(function(results){
motif_length=mapply(function(results){
motif_length=as.matrix(Reduce(cbind,lapply(results,
function(results){
unlist(lapply(results$probKMA_results$V0,nrow))/results$probKMA_results$c*100
})))
return(as.matrix(motif_length))
},results,SIMPLIFY=FALSE)
},results,SIMPLIFY=FALSE)
motif_length_plot=lapply(motif_length_perc,
function(motif_length){
lapply(motif_length,
function(motif_length){
motif_length=motif_length+matrix(rnorm(nrow(motif_length)*n_init,mean=0,sd=0.05),ncol=n_init)
})
})
ymax=max(unlist(motif_length_plot))
for(i in seq_along(K)){
plot(1:n_init,type="n",xaxt='n',xlab='',ylim=c(-1,ymax),ylab='Motif lengths (% of minimum length)',main=paste0('K=',K[i]))
abline(h=100)
axis(1,1:n_init,labels=rep('',n_init))
mtext("Init",side=1,line=1)
shift=seq(-0.1,0.1,length.out=length(c))
silhouette_order=matrix(unlist(lapply(silhouette_average_sd[[i]],function(average_sd) order(average_sd[,1],decreasing=TRUE))),ncol=length(c))
for(ii in seq_along(c)){
points(rep(1:n_init+shift[ii],each=K[i]),
motif_length_plot[[i]][[ii]][,order(silhouette_average_sd[[i]][[ii]][,1],decreasing=TRUE)],col=ii+1,pch=8)
text(rep(1:n_init+shift[ii],each=K[i]),motif_length_perc[[i]][[ii]][,order(silhouette_average_sd[[i]][[ii]][,1],decreasing=TRUE)],
rep(silhouette_order[,ii],each=K[i]),col=ii+1,pos=1,cex=0.7)
}
legend('bottomleft',legend=paste0('c=',c),col=1+seq_along(c),pch=8)
}
dev.off()
pdf(paste0(name,'lengths_clean_perc.pdf'),width=7,height=5)
motif_clean_length_perc=mapply(function(results){
motif_length=mapply(function(results){
motif_length=as.matrix(Reduce(cbind,lapply(results,
function(results){
unlist(lapply(results$probKMA_results$V0_clean,nrow))/results$probKMA_results$c*100
})))
return(as.matrix(motif_length))
},results,SIMPLIFY=FALSE)
},results,SIMPLIFY=FALSE)
motif_length_plot=lapply(motif_clean_length_perc,
function(motif_length){
lapply(motif_length,
function(motif_length){
motif_length=motif_length+matrix(rnorm(nrow(motif_length)*n_init,mean=0,sd=10^-1),ncol=n_init)
})
})
ymax=max(unlist(motif_length_plot))
for(i in seq_along(K)){
plot(1:n_init,type="n",xaxt='n',xlab='',ylim=c(-1,ymax),ylab='Motif lengths (% of minimum length)',main=paste0('K=',K[i]))
abline(h=100)
axis(1,1:n_init,labels=rep('',n_init))
mtext("Init",side=1,line=1)
shift=seq(-0.1,0.1,length.out=length(c))
silhouette_order=matrix(unlist(lapply(silhouette_average_sd[[i]],function(average_sd) order(average_sd[,1],decreasing=TRUE))),ncol=length(c))
for(ii in seq_along(c)){
points(rep(1:n_init+shift[ii],each=K[i]),
motif_length_plot[[i]][[ii]][,order(silhouette_average_sd[[i]][[ii]][,1],decreasing=TRUE)],col=ii+1,pch=8)
text(rep(1:n_init+shift[ii],each=K[i]),motif_clean_length_perc[[i]][[ii]][,order(silhouette_average_sd[[i]][[ii]][,1],decreasing=TRUE)],
rep(silhouette_order[,ii],each=K[i]),col=ii+1,pos=1,cex=0.7)
}
legend('bottomleft',legend=paste0('c=',c),col=1+seq_along(c),pch=8)
}
dev.off()
}
### output ##################################################################################################
find_candidate_motifs_results <- list(name=name,K=K,c=c,n_init=n_init,silhouette_average_sd=silhouette_average_sd,times=times)
save(find_candidate_motifs_results, file = paste0(name,'find_candidate_motifs_results.RData'))
}
### filter candidate motifs based on silhouette average and size
silhouette_average = Reduce(rbind, Reduce(rbind, find_candidate_motifs_results$silhouette_average_sd))[ , 1] # retrieve silhouette average for all candidate motifs
pb$update(0.5)
filter_candidate_motifs_results = filter_candidate_motifs(find_candidate_motifs_results,
sil_threshold = quantile(silhouette_average, probKMA_options$sil_threshold),
size_threshold = 2)
### cluster candidate motifs based on their distance and select radii
pb$update(0.75)
cluster_candidate_motifs_results = cluster_candidate_motifs(filter_candidate_motifs_results,
motif_overlap = 0.6,
worker_number = worker_number)
cluster_candidate_motifs_results$transformed = probKMA_options$transformed
### plot cluster candidate motifs results
pdf(paste0(name,'FMD_clustering_candidate_motifs.pdf'), height = 12, width = 9)
cluster_candidate_motifs_plot(cluster_candidate_motifs_results, ask = FALSE)
dev.off()
### search selected motifs
cluster_candidate_motifs_results$Y0 <- Y0
cluster_candidate_motifs_results$Y1 <- Y1
pb$update(0.9)
motifs_search_results = motifs_search(cluster_candidate_motifs_results,
use_real_occurrences = FALSE, length_diff = +Inf,
worker_number = worker_number)
### plot FMD results (NB: no threshold of frequencies of motif found!)
pdf(paste0(name,'FMD_results.pdf'), height = 7, width = 17)
motifs_search_plot(motifs_search_results, ylab = 'x(t)', freq_threshold = 1,
transformed = probKMA_options$transformed)
dev.off()
save(find_candidate_motifs_results, silhouette_average, filter_candidate_motifs_results,
cluster_candidate_motifs_results, motifs_search_results,
file=paste0(name,'FMD_results.RData'))
pb$update(1.0)
return(list(find_candidate_motifs_results = find_candidate_motifs_results,
silhouette_average = silhouette_average,
filter_candidate_motifs_results = filter_candidate_motifs_results,
cluster_candidate_motifs_results = cluster_candidate_motifs_results,
motifs_search_results = motifs_search_results))
}else if(method=="FunBIalign")
{
if(length(funBIalign_options) != 3) {
stop('\'funBIalign_options\' must contain: \'portion_len\',\'min_card\',\'cut_off\'.')
}
portion_len <- funBIalign_options$portion_len
min_card <- funBIalign_options$min_card
cut_off <- funBIalign_options$cut_off
# check required input
if(missing(portion_len) || is.null(portion_len))
stop('portion_len attribute must be specified')
if(missing(min_card) || is.null(min_card))
stop('min_card attribute must be specified')
pb <- progress_bar$new(
format = "Progress [:bar] :percent | Elapsed: :elapsed | ETA: :eta",
total = 5,
width = 50
)
pb$update(0.1)
#compute maximum length
maxLen <- max(sapply(Y0, nrow))
full_data <- sapply(Y0, padding, maxLen,simplify = FALSE)
# transform list into a matrix where each curve is on a row
full_data <- t(sapply(full_data,cbind))
window_data <- NULL
list_of_recommendations_ordered <- NULL
vec_of_scores_ordered <- NULL
if(!file.exists(paste0(name,'/resFunBi.rds')))
{
dir.create(paste0(name),showWarnings=TRUE)
cppFunction('
Rcpp::List createWindow(Rcpp::NumericMatrix& data,
unsigned int portion_len,
unsigned int worker_number){
const unsigned int totdim = data.ncol();
const unsigned int totobs = data.nrow();
const unsigned int totrows = (totdim - portion_len + 1) * totobs;
const unsigned int totportion = totdim - portion_len + 1;
// set the size for data structure
const arma::mat dataRef(data.begin(), totobs, totdim,false,true);
Rcpp::NumericMatrix windowData(totrows,portion_len);
arma::mat windowDataRef(windowData.begin(),totrows,portion_len,false,true);
std::vector<std::string> window_rownames(totrows);
if(totobs <= 1) // we have a single curve
{
// fill in my data
#ifdef _OPENMP
#pragma omp parallel for num_threads(worker_number)
#endif
for(arma::uword i = 0; i < totrows; ++i)
{
windowDataRef.row(i) = dataRef.cols(i,i + portion_len - 1); //Each peace of curve is stored in a row
window_rownames[i] = "1_" + std::to_string(i + 1) + "_" + std::to_string(i + portion_len);
}
}
else // we have multiple curves
{
// fill in my data
#ifdef _OPENMP
#pragma omp parallel for collapse(2) num_threads(worker_number) schedule(static)
#endif
for(arma::uword k = 0; k < totobs; ++k)
{
for (arma::uword i = 0; i < totportion; ++i)
{
windowDataRef.row(totportion * k + i) = dataRef(k,arma::span(i,i + portion_len - 1));
window_rownames[i + k * totportion] = std::to_string(k+1) + "_" + std::to_string(i + 1) + "_" + std::to_string(i + portion_len);
}
}
}
return Rcpp::List::create(windowData,window_rownames);
}',depends = "RcppArmadillo")
# step 1
window_data_list <- createWindow(full_data,portion_len,worker_number)
window_data <- window_data_list[[1]]
rownames(window_data) <- window_data_list[[2]]
pb$update(0.3)
# set it as a matrix and omit rows with NAs
window_data <- na.omit(window_data)
# compute numerosity
numerosity <- rep(dim(full_data)[2] - portion_len + 1,times = dim(full_data)[1])
removed_rows <- setdiff(window_data_list[[2]],rownames(window_data))
curve_indices <- as.numeric(gsub("^(\\d+)_.*", "\\1", removed_rows))
tab <- table(curve_indices)
numerosity[as.numeric(names(tab))] <- numerosity[as.numeric(names(tab))] - tab
rm(window_data_list)
rm(removed_rows)
rm(tab)
cppFunction('
Rcpp::NumericMatrix createDistance(Rcpp::NumericMatrix& windowData,
const arma::vec& numerosity,
unsigned int worker_number){
int outrows = windowData.nrow();
int outcols = windowData.ncol();
bool isSingle = (numerosity.size() == 1);
double outcols_inv = 1.0 / static_cast<double>(outcols);
const arma::mat windowDataRef(windowData.begin(),outrows,outcols,false,true);
const arma::colvec& vsum = arma::sum(windowDataRef,1) * outcols_inv;
Rcpp::NumericMatrix result(outrows,outrows);
arma::mat scoreData(result.begin(),outrows,outrows,false,true);
#ifdef _OPENMP
#pragma omp parallel for num_threads(worker_number) schedule(static)
#endif
for (arma::uword i = 1; i < outrows; ++i) { // for any functional observation
const arma::rowvec& x_i = windowDataRef.row(i); // curve i
for (arma::uword j = 0; j < i; ++j) {
// Precompute common terms
const arma::rowvec& crossSum = x_i + windowDataRef.row(j);;
const double commonTerm = arma::accu(crossSum) * (outcols_inv * 0.5);
// fill the lower part
scoreData(i,j) = arma::accu(arma::square(
x_i - vsum[i] - crossSum/2
+ commonTerm))*outcols_inv;
}
}
// Modify accolites
double M = static_cast<double>(*std::max_element(scoreData.begin(),scoreData.end())) + 1000.0; // very large distance for accolites
int overlap = static_cast<int>(std::floor(outcols / 2.0)); // number of right/left accolites
// Single curve
if (isSingle) {
for (int i = 0; i < outrows-1; ++i) {
// check for right accolites
arma::uword start = i+1;
arma::uword end = std::min<int>(i+overlap,outrows-1);
scoreData(arma::span(start,end),i) += M;
}
// Multiple curves
} else {
int until_here = 0;
int numerosity_size = numerosity.size();
for(int j = 0; j <numerosity_size;++j)
{
int num = numerosity[j] - 1;
int loop_end = until_here + num;
#ifdef _OPENMP
#pragma omp parallel for num_threads(worker_number) schedule(static)
#endif
for(int i = until_here; i < loop_end ;++i)
{
arma::uword start = i+1;
arma::uword end = std::min<int>(std::min<int>(i+overlap ,i + (num--)),outrows-1);
scoreData(arma::span(start,end),i) += M;
}
until_here += numerosity[j];
}
}
return Rcpp::wrap(result);
}',depends = "RcppArmadillo")
# step 2: compute fMRS-based dissimilarity matrix
D_fmsr <- createDistance(window_data,numerosity,worker_number)
pb$update(0.5)
rownames(D_fmsr) <- colnames(D_fmsr) <- rownames(window_data)
## STEP 3 -----
# step 3: get the sub-trees (tree_s)
minidend <- get_minidend(as.dist(D_fmsr))
# step 3: identify seeds and corresponding families
all_paths <- get_path_complete(minidend, window_data, min_card = min_card,worker_number = worker_number)
all_paths <- all_paths[!(lapply(all_paths, is.null) %>% unlist())]
pb$update(0.65)
# step 3: get recommended nodes and their info (cardinality and score)
# collect all recommended nodes (as an array of portion ids)
all_recommended_labels <- lapply(all_paths, function(x){x$recommended_node_labels})
list_of_recommendations <- lapply(rapply(
all_recommended_labels, enquote, how='unlist'),
eval)
# get recommended node cardinality
vec_of_card <- lapply(list_of_recommendations, length) %>% unlist()
# get vector of adjusted fMSR
vec_of_scores <- lapply(all_paths, function(x){x$recommended_node_scores}) %>% unlist()
resFunBiMeta = list(window_data = window_data,
list_of_recommendations = list_of_recommendations,
vec_of_scores = vec_of_scores)
# save results
saveRDS(resFunBiMeta,file = paste0(name,'/resFunBi.rds'))
} else {
resFunBiMeta <- readRDS(paste0(name,'/resFunBi.rds'))
window_data <- resFunBiMeta$window_data
list_of_recommendations <- resFunBiMeta$list_of_recommendations
vec_of_scores <- resFunBiMeta$vec_of_scores
}
## STEP 4 -----
# STEP 4: post-processing and rearranging results using different criteria
pb$update(0.8)
if (stopCriterion == "Variance") {
motif_var <- lapply(list_of_recommendations,
function(x){
temp <- window_data[x,]
temp_mean <- temp %>% colMeans()
((temp - temp_mean)^2 %>% sum())/(nrow(temp))
}) %>% unlist()
var_order <- motif_var %>% order(decreasing = TRUE)
vec_of_scores_ordered <- vec_of_scores[var_order]
list_of_recommendations_ordered <- list_of_recommendations[var_order]
} else {
### CRITERION: adjusted fMSR ----
best_order <- vec_of_scores %>% order()
vec_of_scores_ordered <- vec_of_scores[best_order]
# ordered list_of_recommendations
list_of_recommendations_ordered <- list_of_recommendations[best_order]
}
#for every recommended motif, compute all its accolites
all_accolites <- lapply(list_of_recommendations_ordered,
function(x){
lapply(x, get_accolites, window_data, portion_len, FALSE) %>% unlist()
})
# Starting from the top, we compare each motif to those with higher rank. If all portions of
# are acolytes to portions of an higher ranking motif, we filter it out;
# otherwise we retain it.
# we identify the ones to delete
cppFunction('Rcpp::IntegerVector deleteV(const Rcpp::List& list_of_recommendations_ordered,
const Rcpp::List& all_accolites,
const Rcpp::NumericVector& vec_of_scores_ordered){
int n = list_of_recommendations_ordered.size();
Rcpp::IntegerVector del;
for (int i = 1; i < n; ++i) { // compare every motif (from the second one)
const Rcpp::CharacterVector& node_1 = list_of_recommendations_ordered[i];
for (int j = 0; j < i; ++j) { // to the other higher ranked nodes
const Rcpp::CharacterVector& node_2 = list_of_recommendations_ordered[j];
if (node_1.size() <= node_2.size()) {
const Rcpp::CharacterVector& accolites_2 = all_accolites[j];
if (vec_of_scores_ordered(i) > vec_of_scores_ordered(j) &&
Rcpp::is_true(Rcpp::all(Rcpp::in(node_1, accolites_2)))){
del.push_back(i+1);
break;
}
}
}
}
return del;
}',depends="RcppArmadillo")
delete <- deleteV(list_of_recommendations_ordered,
all_accolites,
vec_of_scores_ordered)
# we delete the recommended nodes and we order the remaining ones
list_of_recommendations_ordered <- list_of_recommendations_ordered[-delete]
vec_of_scores_ordered <- vec_of_scores_ordered[-delete]
if(!is.null(cut_off)) {
valid_pos <- which(vec_of_scores_ordered < cut_off)
vec_of_scores_ordered <- vec_of_scores_ordered[valid_pos]
list_of_recommendations_ordered <- list_of_recommendations_ordered[valid_pos]
if(is.null(vec_of_scores_ordered))
stop("the value of \'cut_off\' is to high. No motifs discovered. Provide a lower value")
}
pb$update(0.9)
# Creating some plot ----
## Plot the data and highlight the motif occurrences in red -----
if(plot)
{
pdf(paste0(name,"/plot_",stopCriterion,".pdf"),width=10,height=5)
layout(matrix(1:2,ncol=2,byrow=TRUE),widths=c(8.5,1))
for(q in 1:length(list_of_recommendations_ordered)){
temp_motif <- list_of_recommendations_ordered[[q]]
lots_in_motif <- lapply(temp_motif,
function(x){
strsplit(x, '_') %>%
unlist() %>%
as.numeric()}) %>%
unlist() %>%
matrix(ncol=3, byrow=T)
current_curves <- unique(as.numeric(gsub("[^0-9]", "", substr(temp_motif, 1, 2))))
title <- paste0("Number of instances: ", length(temp_motif),
" - adj fMSR: ", vec_of_scores_ordered[q] %>% round(3))
par(mar = c(5, 4, 2, 2) + 0.1)
matplot(t(full_data), ylab='', xlab='',
lwd=1.5,lty = 1, type = 'l', main = title,col=alpha('gray30',0.15))
matplot(matrix(full_data[current_curves,],ncol=length(current_curves),byrow = TRUE), ylab='', xlab='',
lwd=1.5,lty = 5, type = 'l', main = title,col=rainbow(length(current_curves)),add=TRUE)
rect(lots_in_motif[,2], min(full_data,na.rm = TRUE)-10, lots_in_motif[,3], max(full_data,na.rm = TRUE) +10,
border = alpha("firebrick3", 0.05), col = alpha("firebrick3", 0.05))
lapply(1:nrow(lots_in_motif), function(k) {
matplot(lots_in_motif[k, 2]:lots_in_motif[k, 3],
full_data[lots_in_motif[k, 1], lots_in_motif[k, 2]:lots_in_motif[k, 3]],
type = 'l', add = TRUE, col = 'red', lwd = 2.5)
#box(col = "grey40", lwd = 2)
#axis(1, at = seq(1, length(Y0), by = 12), col = "grey40", col.ticks = "grey40", col.axis = "grey60", cex.axis = 1.5)
#axis(2, col = "grey40", col.ticks = "grey40", col.axis = "grey60", cex.axis = 1.5)
})
par(mar=c(0,0,0,0))
plot.new()
legend("topright",
legend = c(paste0("c",current_curves),paste0('motif_',q)),
col = c(rainbow(length(current_curves)),'red'),lwd=c(rep(1,length(current_curves)),4), lty = 1,cex=0.75)
}
layout(matrix(1:2,ncol=2,byrow=TRUE),widths=c(5,1))
## Plot only the motif -----
for(q in 1:length(list_of_recommendations_ordered)){
par(mar = c(5, 4, 2, 2) + 0.1)
temp_motif <- list_of_recommendations_ordered[[q]]
current_curves <- as.numeric(gsub("[^0-9]", "", substr(temp_motif, 1, 2)))
plot_me <- t(window_data[temp_motif,])
motifMean <- rowMeans(plot_me)
title <- paste0("Number of occurrences: ", dim(plot_me)[2],
" - adj fMSR: ", vec_of_scores_ordered[q] %>% round(3))
matplot(plot_me, ylab='', xlab='',type='l',
col=seq_len(length(temp_motif))+1,lwd=2,lty=5, main = title)
lines(motifMean,col='black',lwd=5,lty=1)
par(mar=c(0,0,0,0))
plot.new()
legend('topright',
legend=c(paste0('c',current_curves),'motif center'),
col=c(seq_len(length(temp_motif))+1,'black'),lwd=c(rep(2,length(temp_motif)),5),lty=c(rep(5,length(temp_motif)),1),
xpd=TRUE,cex=0.75)
#box(col="grey40", lwd = 2)
#axis(1, at=1:portion_len, labels = 1:portion_len, col="grey40", col.ticks="grey40", col.axis="grey60", cex.axis=1.5)
#axis(2, col="grey40", col.ticks="grey40", col.axis="grey60", cex.axis=1.5)
}
dev.off()
}
pb$update(1.0)
resFunBi = list(list_of_recommendations_ordered = list_of_recommendations_ordered,
vec_of_scores_ordered = vec_of_scores_ordered)
return(resFunBi)
} else {
stop('\'method\' not found. It must be choosen between \'probKMA\' and \'funBIalign\'')
}
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.