discoverMotifs: Functional Motif Discovery

View source: R/discoverMotifs.R

discoverMotifsR Documentation

Functional Motif Discovery

Description

The 'discoverMotifs' function facilitates the discovery of recurring patterns, or motifs, within functional data by employing two sophisticated algorithms: ProbKMA (Probabilistic K-means with Local Alignment) and funBIalign. These algorithms are designed to identify and cluster functional motifs across multiple curves, leveraging advanced clustering and alignment techniques to handle complex data structures.

ProbKMA integrates probabilistic clustering with local alignment strategies, enabling the detection of motifs that exhibit variability in both shape and position across different curves. This method is particularly adept at handling noisy data and motifs that may appear at varying scales or locations within the curves.

On the other hand, funBIalign utilizes hierarchical clustering based on mean squared residue scores to uncover motifs. This approach effectively captures the additive nature of functional motifs, considering both portion-specific adjustments and time-varying components to accurately identify recurring patterns.

By providing a flexible interface that accommodates different clustering paradigms, 'discoverMotifs' empowers users to perform robust motif discovery tailored to their specific data characteristics and analytical requirements. Whether opting for the probabilistic and alignment-focused ProbKMA or the hierarchical and residue-based funBIalign, users can leverage these methods to extract meaningful and interpretable motifs from their functional datasets.

Usage

discoverMotifs(
  Y0,
  method,
  stopCriterion,
  name,
  plot,
  probKMA_options = list(),
  funBIalign_options = list(portion_len = NULL, min_card = NULL, cut_off = NULL),
  worker_number = NULL
)

Arguments

Y0

A list containing N vectors (for univariate curves) or N matrices (for multivariate curves) representing the functional data.

method

A character string specifying the motif discovery algorithm to use. Acceptable values are "ProbKMA" for Probabilistic K-means with Local Alignment and "funBIalign" for Functional Bi-directional Alignment.

stopCriterion

A character string indicating the convergence criterion for the selected algorithm.

name

A character string specifying the name of the output directory where results will be saved.

plot

A logical value indicating whether to generate and save plots of the discovered motifs and clustering results.

probKMA_options

A list of options specific to the ProbKMA algorithm.

funBIalign_options

A list of options specific to the funBIalign algorithm.

worker_number

An integer specifying the number of CPU cores to utilize for parallel computations.

Details

The ‘discoverMotifs' function dynamically switches between two advanced motif discovery algorithms based on the user’s specification. Each algorithm employs distinct strategies to identify and cluster motifs within functional data, offering flexibility and adaptability to various analytical scenarios.

Value

A list containing the discovered motifs and their corresponding statistics, tailored to the selected method:

motifs

A list of identified motifs, each containing the motif's representative curve, membership probabilities, and alignment information.

statistics

Detailed statistics for each motif, including measures such as silhouette scores, variance explained, and other relevant metrics that quantify the quality and significance of the discovered motifs.

parameters

The final parameters and configurations used during the motif discovery process, providing transparency and facilitating reproducibility of the results.

plots

If plot = TRUE, this component contains the generated plots visualizing the motifs and their distribution across the functional data.

Theoretical Background for ProbKMA

ProbKMA is inspired by methodologies prevalent in bioinformatics, particularly those involving local alignment techniques extended from high-similarity seeds. This algorithm combines fuzzy clustering approaches with local alignment strategies to effectively minimize a generalized least squares functional. The minimization process can incorporate both the levels and derivatives of the curves through a Sobolev-based distance metric, enhancing the algorithm's sensitivity to both shape and rate changes in the data.

Throughout its iterative process, ProbKMA refines motif centers, membership probabilities, and alignment shifts, making it highly effective for capturing complex motif structures and motifs distributed across multiple curves. This ensures that the discovered motifs are both representative and robust against variations and noise within the functional data.

Theoretical Background for funBIalign

funBIalign models functional motifs as an additive combination of motif means, portion-specific adjustments, and time-varying components. The algorithm constructs a hierarchical dendrogram utilizing the generalized mean squared residue score (fMSR) to identify candidate motifs across curves.

A critical aspect of funBIalign is its post-processing step, which filters out redundant motifs and refines the final selection to ensure that only the most significant and representative motifs are retained. This hierarchical approach allows for a nuanced identification of motifs, capturing both broad and subtle patterns within the data.

Common Parameters

The following parameters are common to both ProbKMA and funBIalign algorithms:

Y0

A list containing N vectors (for univariate curves) or N matrices (for multivariate curves) representing the functional data. Each curve is evaluated on a uniform grid, ensuring consistency across the dataset.

method

A character string specifying the motif discovery algorithm to use. Acceptable values are "ProbKMA" for Probabilistic K-means with Local Alignment and "funBIalign" for Functional Bi-directional Alignment.

stopCriterion

A character string indicating the convergence criterion for the selected algorithm. For ProbKMA, options include "max", "mean", or "quantile" based on the Bhattacharyya distance between memberships in successive iterations. For funBIalign, options are "fMRS" (functional Mean Squared Residue) or "Variance" to guide the ranking of motifs.

name

A character string specifying the name of the output directory where results will be saved. This facilitates organized storage and easy retrieval of analysis results.

plot

A logical value indicating whether to generate and save plots of the discovered motifs and clustering results. When set to TRUE, visualizations are produced to aid in the qualitative assessment of the motif discovery process.

worker_number

An integer specifying the number of CPU cores to utilize for parallel computations. By default, the function uses the total number of available cores minus one, optimizing computational efficiency without overloading the system.

ProbKMA Options

The following parameters are specific to the ProbKMA algorithm:

K

An integer or vector specifying the number of motifs to be discovered. It can be a single integer for uniform motif discovery or a vector for specifying different numbers of motifs.

c

An integer or vector indicating the minimum motif lengths. This ensures that each discovered motif meets a specified minimum length requirement, maintaining the integrity of motif structures.

c_max

An integer or vector specifying the maximum motif lengths, allowing control over the upper bounds of motif sizes to prevent excessively long motifs.

diss

A character string defining the dissimilarity measure to use. Possible values include "d0_L2", "d1_L2", and "d0_d1_L2", which determine how the algorithm quantifies differences between motifs based on level and derivative information.

alpha

A numeric value between 0 and 1 that serves as a weight parameter between d0_L2 and d1_L2 when using d0_d1_L2. An alpha of 0 emphasizes d0_L2, while an alpha of 1 emphasizes d1_L2, allowing for balanced consideration of both metrics.

w

A numeric vector specifying the weight for the dissimilarity index across different dimensions. All values must be positive, enabling the algorithm to prioritize certain dimensions over others based on their relative importance.

m

A numeric value greater than 1 that acts as the weighting exponent in the least-squares functional method. This parameter influences the sensitivity of the algorithm to differences in motif alignment and membership probabilities.

iter_max

An integer specifying the maximum number of iterations allowed for the algorithm to converge. This prevents excessive computation time by limiting the number of optimization steps.

quantile

A numeric value representing the quantile probability used when stopCriterion is set to "quantile". This determines the threshold for convergence based on the distribution of Bhattacharyya distances.

tol

A numeric value specifying the tolerance level for convergence. The algorithm stops iterating if the change in the stop criterion falls below this threshold, ensuring precise and stable convergence.

iter4elong

An integer indicating the number of iterations after which motif elongation is performed. If set to a value greater than iter_max, no elongation is performed. Motif elongation allows the algorithm to extend motifs to better fit the data.

tol4elong

A numeric value defining the tolerance on the Bhattacharyya distance for motif elongation. This parameter controls how much the objective function can increase during elongation, ensuring that motif extensions do not degrade the overall fit.

max_elong

A numeric value representing the maximum elongation allowed in a single iteration, expressed as a percentage of the motif length. This prevents excessive extension of motifs in any single step.

trials_elong

An integer specifying the number of elongation trials (equispaced) on each side of the motif in a single iteration. Multiple trials enhance the robustness of motif elongation by exploring various extension possibilities.

deltaJK_elong

A numeric value indicating the maximum relative increase in the objective function permitted during motif elongation. This ensures that elongation steps contribute positively to the motif fitting process.

max_gap

A numeric value defining the maximum gap allowed in each alignment as a percentage of the motif length. This parameter controls the allowable discontinuity between aligned motifs, maintaining coherence in motif placement.

iter4clean

An integer specifying the number of iterations after which motif cleaning is performed. If set to a value greater than iter_max, no cleaning is performed. Motif cleaning removes redundant or poorly fitting motifs to refine the final motif set.

tol4clean

A numeric value representing the tolerance on the Bhattacharyya distance for motif cleaning. This parameter determines the threshold for identifying and removing redundant motifs during the cleaning process.

quantile4clean

A numeric value specifying the dissimilarity quantile used for motif cleaning. This quantile determines which motifs are considered sufficiently dissimilar to be retained in the final set.

return_options

A logical value indicating whether to return the options passed to the ProbKMA method. When set to TRUE, users receive detailed information about the algorithm's configuration, facilitating transparency and reproducibility.

Y1

A list of derivative curves used if the dissimilarity measure "d0_d1_L2" is selected. These derivatives enhance the algorithm's ability to capture both shape and rate changes in the functional data.

P0

An initial membership matrix (N x K), where N is the number of curves and K is the number of clusters. If set to NULL, a random matrix is generated, initiating the probabilistic clustering process.

S0

An initial shift warping matrix (N x K). If set to NULL, a random matrix is generated to initialize the alignment process, allowing motifs to adapt to variations in the data.

n_subcurves

An integer specifying the number of splitting subcurves used when the number of curves is equal to one. This parameter allows the algorithm to handle single-curve datasets by dividing them into manageable segments for motif discovery.

sil_threshold

A numeric value representing the threshold to filter candidate motifs based on their silhouette scores. This ensures that only motifs with sufficient clustering quality are retained in the final results.

set_seed

A logical value indicating whether to set a random seed for reproducibility. When set to TRUE, the function initializes the random number generator to ensure consistent results across multiple runs.

seed

An integer specifying the random seed used for initialization when set_seed is TRUE. This parameter guarantees reproducibility of the clustering and alignment processes.

exe_print

A logical value determining whether to print execution details for each iteration. When set to TRUE, users receive real-time feedback on the algorithm's progress, aiding in monitoring and debugging.

V_init

A list of motif sets provided as specific initializations for clustering rather than using random initializations. The 'V_init' parameter allows users to provide a set of motifs as starting points for the algorithm, instead of relying on random initialization. If 'n_init' is specified as greater than the number of motifs given in 'V_init', the remaining initializations will be randomly generated. For example, if 'n_init = 10' but only 5 motif sets are given in 'V_init', the algorithm will use these 5 initializations and generate an additional 5 randomly.

transformed

A logical value indicating whether to normalize the curve segments to the interval [0,1] before applying the dissimilarity measure. Setting 'transformed = TRUE' scales each curve segment between 0 and 1, which allows for the identification of motifs with consistent shapes but different amplitudes. This normalization is useful for cases where motif occurrences may vary in amplitude but have similar shapes, enabling better pattern recognition across diverse data scales.

n_init_motif

The number of initial motif sets from 'V_init' to be used directly as starting points in clustering. If 'n_init_motif' is set to a value larger than the number of motifs provided in 'V_init', additional initializations will be generated randomly to meet the specified number. For example, if 'n_init = 10' and 'n_init_motif = 5' with only 3 motif sets in 'V_init', the algorithm will use these 3 sets and generate 7 additional random initializations.

funBIalign Options

The following parameters are specific to the funBIalign algorithm:

portion_len

An integer specifying the length of curve portions to align. This parameter controls the granularity of alignment, allowing the algorithm to focus on specific segments of the curves for motif discovery.

min_card

An integer representing the minimum cardinality of motifs, i.e., the minimum number of motif occurrences required for a motif to be considered valid. This ensures that only motifs with sufficient representation across the dataset are retained.

cut_off

A double that specifies the number of top-ranked motifs to keep based on the ranking criteria, facilitating focused visualization of the most significant motifs. In particular, all motifs that rank below the cut_off are retained.

See Also

ProbKMA: Probabilistic K-means with Local Alignment
funBIalign: Hierarchical Clustering with Mean Squared Residue Scores.

Examples


# Example 1: Discover motifs using ProbKMA

# Define dissimilarity measure and weight parameter
diss <- 'd0_d1_L2'
alpha <- 0.5

# Define number of motifs and their minimum lengths
K <- c(2, 3)
c <- c(61, 51)
n_init <- 10

# Load simulated data
data("simulated200")

# Perform motif discovery using ProbKMA
results <- funMoDisco::discoverMotifs(
  Y0 = simulated200$Y0,
  method = "ProbKMA",
  stopCriterion = "max",
  name = tempdir(),
  plot = TRUE,
  probKMA_options = list(
    Y1 = simulated200$Y1,
    K = K,
    c = c,
    n_init = n_init,
    diss = diss,
    alpha = alpha
  ),
  worker_number = NULL
)

# Modify silhouette threshold and re-run post-processing
results <- funMoDisco::discoverMotifs(
  Y0 = simulated200$Y0,
  method = "ProbKMA",
  stopCriterion = "max",
  name = tempdir(),
  plot = TRUE,
  probKMA_options = list(
    Y1 = simulated200$Y1,
    K = K,
    c = c,
    n_init = n_init,
    diss = diss,
    alpha = alpha,
    sil_threshold = 0.5
  ),
  worker_number = NULL
)

# Example 2: Discover motifs using funBIalign
results_funbialign <- funMoDisco::discoverMotifs(
  Y0 = simulated200$Y0,
  method = "funBIalign",
  stopCriterion = 'Variance',
  name = tempdir(),
  plot = TRUE,
  funBIalign_options = list(
    portion_len = 60,
    min_card = 3,
    cut_off = 1.0
  )
)



funMoDisco documentation built on April 16, 2025, 1:10 a.m.