View source: R/discoverMotifs.R
| discoverMotifs | R Documentation |
The 'discoverMotifs' function facilitates the discovery of recurring patterns, or motifs, within functional data by employing two sophisticated algorithms:
ProbKMA (Probabilistic K-means with Local Alignment) and funBIalign.
These algorithms are designed to identify and cluster functional motifs across multiple curves, leveraging advanced clustering and alignment techniques to handle complex data structures.
ProbKMA integrates probabilistic clustering with local alignment strategies, enabling the detection of motifs that exhibit variability in both shape and position across different curves.
This method is particularly adept at handling noisy data and motifs that may appear at varying scales or locations within the curves.
On the other hand, funBIalign utilizes hierarchical clustering based on mean squared residue scores to uncover motifs.
This approach effectively captures the additive nature of functional motifs, considering both portion-specific adjustments and time-varying components to accurately identify recurring patterns.
By providing a flexible interface that accommodates different clustering paradigms, 'discoverMotifs' empowers users to perform robust motif discovery tailored to their specific data characteristics and analytical requirements.
Whether opting for the probabilistic and alignment-focused ProbKMA or the hierarchical and residue-based funBIalign, users can leverage these methods to extract meaningful and interpretable motifs from their functional datasets.
discoverMotifs(
Y0,
method,
stopCriterion,
name,
plot,
probKMA_options = list(),
funBIalign_options = list(portion_len = NULL, min_card = NULL, cut_off = NULL),
worker_number = NULL
)
Y0 |
A list containing N vectors (for univariate curves) or N matrices (for multivariate curves) representing the functional data. |
method |
A character string specifying the motif discovery algorithm to use. Acceptable values are "ProbKMA" for Probabilistic K-means with Local Alignment and "funBIalign" for Functional Bi-directional Alignment. |
stopCriterion |
A character string indicating the convergence criterion for the selected algorithm. |
name |
A character string specifying the name of the output directory where results will be saved. |
plot |
A logical value indicating whether to generate and save plots of the discovered motifs and clustering results. |
probKMA_options |
A list of options specific to the ProbKMA algorithm. |
funBIalign_options |
A list of options specific to the funBIalign algorithm. |
worker_number |
An integer specifying the number of CPU cores to utilize for parallel computations. |
The ‘discoverMotifs' function dynamically switches between two advanced motif discovery algorithms based on the user’s specification. Each algorithm employs distinct strategies to identify and cluster motifs within functional data, offering flexibility and adaptability to various analytical scenarios.
A list containing the discovered motifs and their corresponding statistics, tailored to the selected method:
motifsA list of identified motifs, each containing the motif's representative curve, membership probabilities, and alignment information.
statisticsDetailed statistics for each motif, including measures such as silhouette scores, variance explained, and other relevant metrics that quantify the quality and significance of the discovered motifs.
parametersThe final parameters and configurations used during the motif discovery process, providing transparency and facilitating reproducibility of the results.
plotsIf plot = TRUE, this component contains the generated plots visualizing the motifs and their distribution across the functional data.
ProbKMA is inspired by methodologies prevalent in bioinformatics, particularly those involving local alignment techniques extended from high-similarity seeds.
This algorithm combines fuzzy clustering approaches with local alignment strategies to effectively minimize a generalized least squares functional.
The minimization process can incorporate both the levels and derivatives of the curves through a Sobolev-based distance metric, enhancing the algorithm's sensitivity to both shape and rate changes in the data.
Throughout its iterative process, ProbKMA refines motif centers, membership probabilities, and alignment shifts, making it highly effective for capturing complex motif structures and motifs distributed across multiple curves.
This ensures that the discovered motifs are both representative and robust against variations and noise within the functional data.
funBIalign models functional motifs as an additive combination of motif means, portion-specific adjustments, and time-varying components.
The algorithm constructs a hierarchical dendrogram utilizing the generalized mean squared residue score (fMSR) to identify candidate motifs across curves.
A critical aspect of funBIalign is its post-processing step, which filters out redundant motifs and refines the final selection to ensure that only the most significant and representative motifs are retained.
This hierarchical approach allows for a nuanced identification of motifs, capturing both broad and subtle patterns within the data.
The following parameters are common to both ProbKMA and funBIalign algorithms:
Y0A list containing N vectors (for univariate curves) or N matrices (for multivariate curves) representing the functional data. Each curve is evaluated on a uniform grid, ensuring consistency across the dataset.
methodA character string specifying the motif discovery algorithm to use. Acceptable values are "ProbKMA" for Probabilistic K-means with Local Alignment and "funBIalign" for Functional Bi-directional Alignment.
stopCriterionA character string indicating the convergence criterion for the selected algorithm. For ProbKMA, options include "max", "mean", or "quantile" based on the Bhattacharyya distance between memberships in successive iterations. For funBIalign, options are "fMRS" (functional Mean Squared Residue) or "Variance" to guide the ranking of motifs.
nameA character string specifying the name of the output directory where results will be saved. This facilitates organized storage and easy retrieval of analysis results.
plotA logical value indicating whether to generate and save plots of the discovered motifs and clustering results. When set to TRUE, visualizations are produced to aid in the qualitative assessment of the motif discovery process.
worker_numberAn integer specifying the number of CPU cores to utilize for parallel computations. By default, the function uses the total number of available cores minus one, optimizing computational efficiency without overloading the system.
The following parameters are specific to the ProbKMA algorithm:
KAn integer or vector specifying the number of motifs to be discovered. It can be a single integer for uniform motif discovery or a vector for specifying different numbers of motifs.
cAn integer or vector indicating the minimum motif lengths. This ensures that each discovered motif meets a specified minimum length requirement, maintaining the integrity of motif structures.
c_maxAn integer or vector specifying the maximum motif lengths, allowing control over the upper bounds of motif sizes to prevent excessively long motifs.
dissA character string defining the dissimilarity measure to use. Possible values include "d0_L2", "d1_L2", and "d0_d1_L2", which determine how the algorithm quantifies differences between motifs based on level and derivative information.
alphaA numeric value between 0 and 1 that serves as a weight parameter between d0_L2 and d1_L2 when using d0_d1_L2. An alpha of 0 emphasizes d0_L2, while an alpha of 1 emphasizes d1_L2, allowing for balanced consideration of both metrics.
wA numeric vector specifying the weight for the dissimilarity index across different dimensions. All values must be positive, enabling the algorithm to prioritize certain dimensions over others based on their relative importance.
mA numeric value greater than 1 that acts as the weighting exponent in the least-squares functional method. This parameter influences the sensitivity of the algorithm to differences in motif alignment and membership probabilities.
iter_maxAn integer specifying the maximum number of iterations allowed for the algorithm to converge. This prevents excessive computation time by limiting the number of optimization steps.
quantileA numeric value representing the quantile probability used when stopCriterion is set to "quantile". This determines the threshold for convergence based on the distribution of Bhattacharyya distances.
tolA numeric value specifying the tolerance level for convergence. The algorithm stops iterating if the change in the stop criterion falls below this threshold, ensuring precise and stable convergence.
iter4elongAn integer indicating the number of iterations after which motif elongation is performed. If set to a value greater than iter_max, no elongation is performed. Motif elongation allows the algorithm to extend motifs to better fit the data.
tol4elongA numeric value defining the tolerance on the Bhattacharyya distance for motif elongation. This parameter controls how much the objective function can increase during elongation, ensuring that motif extensions do not degrade the overall fit.
max_elongA numeric value representing the maximum elongation allowed in a single iteration, expressed as a percentage of the motif length. This prevents excessive extension of motifs in any single step.
trials_elongAn integer specifying the number of elongation trials (equispaced) on each side of the motif in a single iteration. Multiple trials enhance the robustness of motif elongation by exploring various extension possibilities.
deltaJK_elongA numeric value indicating the maximum relative increase in the objective function permitted during motif elongation. This ensures that elongation steps contribute positively to the motif fitting process.
max_gapA numeric value defining the maximum gap allowed in each alignment as a percentage of the motif length. This parameter controls the allowable discontinuity between aligned motifs, maintaining coherence in motif placement.
iter4cleanAn integer specifying the number of iterations after which motif cleaning is performed. If set to a value greater than iter_max, no cleaning is performed. Motif cleaning removes redundant or poorly fitting motifs to refine the final motif set.
tol4cleanA numeric value representing the tolerance on the Bhattacharyya distance for motif cleaning. This parameter determines the threshold for identifying and removing redundant motifs during the cleaning process.
quantile4cleanA numeric value specifying the dissimilarity quantile used for motif cleaning. This quantile determines which motifs are considered sufficiently dissimilar to be retained in the final set.
return_optionsA logical value indicating whether to return the options passed to the ProbKMA method. When set to TRUE, users receive detailed information about the algorithm's configuration, facilitating transparency and reproducibility.
Y1A list of derivative curves used if the dissimilarity measure "d0_d1_L2" is selected. These derivatives enhance the algorithm's ability to capture both shape and rate changes in the functional data.
P0An initial membership matrix (N x K), where N is the number of curves and K is the number of clusters. If set to NULL, a random matrix is generated, initiating the probabilistic clustering process.
S0An initial shift warping matrix (N x K). If set to NULL, a random matrix is generated to initialize the alignment process, allowing motifs to adapt to variations in the data.
n_subcurvesAn integer specifying the number of splitting subcurves used when the number of curves is equal to one. This parameter allows the algorithm to handle single-curve datasets by dividing them into manageable segments for motif discovery.
sil_thresholdA numeric value representing the threshold to filter candidate motifs based on their silhouette scores. This ensures that only motifs with sufficient clustering quality are retained in the final results.
set_seedA logical value indicating whether to set a random seed for reproducibility. When set to TRUE, the function initializes the random number generator to ensure consistent results across multiple runs.
seedAn integer specifying the random seed used for initialization when set_seed is TRUE. This parameter guarantees reproducibility of the clustering and alignment processes.
exe_printA logical value determining whether to print execution details for each iteration. When set to TRUE, users receive real-time feedback on the algorithm's progress, aiding in monitoring and debugging.
V_initA list of motif sets provided as specific initializations for clustering rather than using random initializations. The 'V_init' parameter allows users to provide a set of motifs as starting points for the algorithm, instead of relying on random initialization. If 'n_init' is specified as greater than the number of motifs given in 'V_init', the remaining initializations will be randomly generated. For example, if 'n_init = 10' but only 5 motif sets are given in 'V_init', the algorithm will use these 5 initializations and generate an additional 5 randomly.
transformedA logical value indicating whether to normalize the curve segments to the interval [0,1] before applying the dissimilarity measure. Setting 'transformed = TRUE' scales each curve segment between 0 and 1, which allows for the identification of motifs with consistent shapes but different amplitudes. This normalization is useful for cases where motif occurrences may vary in amplitude but have similar shapes, enabling better pattern recognition across diverse data scales.
n_init_motifThe number of initial motif sets from 'V_init' to be used directly as starting points in clustering. If 'n_init_motif' is set to a value larger than the number of motifs provided in 'V_init', additional initializations will be generated randomly to meet the specified number. For example, if 'n_init = 10' and 'n_init_motif = 5' with only 3 motif sets in 'V_init', the algorithm will use these 3 sets and generate 7 additional random initializations.
The following parameters are specific to the funBIalign algorithm:
portion_lenAn integer specifying the length of curve portions to align. This parameter controls the granularity of alignment, allowing the algorithm to focus on specific segments of the curves for motif discovery.
min_cardAn integer representing the minimum cardinality of motifs, i.e., the minimum number of motif occurrences required for a motif to be considered valid. This ensures that only motifs with sufficient representation across the dataset are retained.
cut_offA double that specifies the number of top-ranked motifs to keep based on the ranking criteria, facilitating focused visualization of the most significant motifs. In particular, all motifs that rank below the cut_off are retained.
ProbKMA:
Probabilistic K-means with Local Alignment
funBIalign:
Hierarchical Clustering with Mean Squared Residue Scores.
# Example 1: Discover motifs using ProbKMA
if (requireNamespace("RcppArmadillo", quietly = TRUE)) {
# Define dissimilarity measure and weight parameter
diss <- 'd0_d1_L2'
alpha <- 0.5
# Define number of motifs and their minimum lengths
K <- c(2, 3)
c <- c(61, 51)
n_init <- 3
# Load simulated data
data("simulated200")
# Perform motif discovery using ProbKMA
tmp_path <- tempdir()
results <- funMoDisco::discoverMotifs(
Y0 = simulated200$Y0,
method = "ProbKMA",
stopCriterion = "max",
name = tempdir(),
plot = FALSE,
probKMA_options = list(
Y1 = simulated200$Y1,
K = K,
c = c,
n_init = n_init,
diss = diss,
alpha = alpha
),
worker_number = 1
)
# Example 2: Discover motifs using funBIalign
results_funbialign <- funMoDisco::discoverMotifs(
Y0 = simulated200$Y0,
method = "funBIalign",
stopCriterion = "fMSR",
name = tempdir(),
plot = FALSE,
funBIalign_options = list(
portion_len = 60,
min_card = 3,
cut_off = 1.0
),
worker_number = 1
)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.