motifs_search: Motif Search in Curves

View source: R/motifs_search.R

motifs_searchR Documentation

Description

The 'motifs_search' function identifies and ranks motifs within a set of curves based on their frequencies and dissimilarity measures. It processes candidate motifs clustered from hierarchical clustering results, selects optimal motifs within each cluster, and determines their occurrences in the original curves. The function supports parallel processing to enhance computational efficiency and offers flexibility in handling different dissimilarity metrics and motif selection criteria.

Usage

motifs_search(
  cluster_candidate_motifs_results,
  R_all = cluster_candidate_motifs_results$R_all,
  R_m = NULL,
  different_R_m_finding = FALSE,
  R_m_finding = NULL,
  use_real_occurrences = FALSE,
  length_diff = Inf,
  worker_number = NULL
)

Arguments

cluster_candidate_motifs_results

A list containing the output from the 'cluster_candidate_motifs' function. This list must include elements such as:

Y0

A list of matrices representing the original curves.

Y1

A list of matrices representing the derivatives of the curves (if applicable).

V0_clean

A list of candidate motifs derived from 'Y0'.

V1_clean

A list of candidate motifs derived from 'Y1' (if applicable).

D_clean

A matrix of dissimilarity measures between motifs and curves.

P_clean

A matrix indicating positive matches (e.g., presence of motifs in curves).

hclust_res

A hierarchical clustering object obtained from 'hclust'.

R_all

A numeric value representing the global radius used for dendrogram cutting.

w

A numeric vector of weights for the dissimilarity index across different dimensions.

transformed

A logical value indicating whether to normalize the curve segments to the interval [0,1] before applying the dissimilarity measure. Setting 'transformed = TRUE' scales each curve segment between 0 and 1, which allows for the identification of motifs with consistent shapes but different amplitudes. This normalization is useful for cases where motif occurrences may vary in amplitude but have similar shapes, enabling better pattern recognition across diverse data scales.

max_gap

A numeric value defining the maximum allowed gap in distances for cluster separation.

k_knn

An integer specifying the number of neighbors for K-Nearest Neighbors classification.

votes_knn_Rm

A numeric value defining the probability threshold for KNN-based radius determination.

c

A numeric vector specifying the minimum number of overlapping elements required for motif validation.

R_all

A numeric value representing the global radius used to cut the dendrogram, ensuring that clusters are at least twice this radius apart. This parameter defines the grouping of motifs into clusters.

R_m

A numeric vector containing group-specific radii used to identify motif occurrences within each cluster. The length of this vector must match the number of clusters obtained by cutting the dendrogram at a height of '2 * R_all'. If 'NULL', the function automatically determines 'R_m' for each group based on the distances between motifs within the same cluster and all curves.

different_R_m_finding

A logical value indicating whether to use a different radius ('R_m_finding') for finding motif occurrences compared to the initial radius ('R_m'). If 'TRUE', 'R_m_finding' is used; otherwise, 'R_m' is employed. This allows for separate tuning of motif occurrence detection.

R_m_finding

A numeric vector containing group-specific radii used specifically for finding motif occurrences when 'different_R_m_finding' is set to 'TRUE'. The length of this vector must match the number of clusters obtained by cutting the dendrogram at a height of '2 * R_all'. If 'NULL', 'R_m_finding' is determined automatically for each group based on distances between motifs within the same cluster and all curves.

use_real_occurrences

A logical value indicating whether to compute real occurrences of candidate motifs within the curves. If 'TRUE', the function calculates actual frequencies and mean dissimilarities for motif selection, providing more accurate results at the cost of increased computation time. If 'FALSE', it uses approximate frequencies and mean dissimilarities for faster execution. Defaults to 'FALSE'.

length_diff

A numeric value specifying the minimum percentage difference in length required among motifs within the same group to retain multiple motifs. This parameter ensures diversity in motif selection by preventing motifs of similar lengths from being selected simultaneously. It is defined as a percentage relative to the length of the most frequent motif. Defaults to 'Inf', meaning no additional motifs are selected based on length differences.

worker_number

An integer indicating the number of CPU cores to utilize for parallel processing. By default, the function uses one less than the total number of available cores ('detectCores() - 1'). Setting 'worker_number = 1' forces the function to run sequentially without parallelization. If 'NULL', the function automatically determines the optimal number of workers based on the system's available cores.

Details

The 'motifs_search' function operates through the following steps:

  1. **Parallelization Setup**: Determines the number of worker cores to use based on 'worker_number'. If 'worker_number > 1', it initializes a cluster for parallel processing.

  2. **Input Preparation**: Depending on the dissimilarity metric ('d0_L2', 'd1_L2', or 'd0_d1_L2'), it prepares the data structures 'Y' and 'V' for processing.

  3. **Dendrogram Cutting**: Cuts the hierarchical clustering dendrogram at a height of '2 * R_all' to define clusters of motifs.

  4. **Radius Determination**: If 'R_m' or 'R_m_finding' is not provided, the function calculates these radii for each cluster based on motif distances and K-Nearest Neighbors (KNN) classification.

  5. **Candidate Motif Selection**: Depending on 'use_real_occurrences', the function either computes real occurrences and uses actual frequencies and mean dissimilarities to select motifs, or it uses approximate measures for faster processing.

  6. **Motif Filtering**: Within each cluster, motifs are ranked based on their frequency and mean dissimilarity. Additional motifs can be selected if their lengths differ sufficiently from the most frequent motif, as defined by 'length_diff'.

  7. **Output Compilation**: The selected motifs and their associated properties are compiled into a comprehensive list for further analysis or visualization.

Value

A list containing:

V0

A list of selected motifs derived from 'Y0'.

V1

A list of selected motifs derived from 'Y1' (if applicable).

V_length

A numeric vector representing the real lengths of the selected motifs.

V_occurrences

A list detailing the occurrences of each selected motif within the curves.

V_frequencies

A numeric vector indicating the real frequencies of each selected motif.

V_mean_diss

A numeric vector representing the average dissimilarity of each selected motif.

Y0

A list of matrices corresponding to the original curves, as provided in 'cluster_candidate_motifs_results'.

Y1

A list of matrices corresponding to the derivatives of the curves (if applicable), as provided in 'cluster_candidate_motifs_results'.

R_motifs

A numeric vector containing the radii associated with each selected motif.


funMoDisco documentation built on April 16, 2025, 1:10 a.m.