gsstda | R Documentation |
Gene Structure Survival using Topological Data Analysis. This function implements an analysis for expression array data based on the Progression Analysis of Disease developed by Nicolau et al. (doi: 10.1073/pnas.1102826108) that allows the information contained in an expression matrix to be condensed into a combinatory graph. The novelty is that information on survival is integrated into the analysis.
The analysis consists of 3 parts: a preprocessing of the data, the gene selection and the filter function, and the mapper algorithm. The preprocessing is specifically the Disease Specific Genomic Analysis (proposed by Nicolau et al.) that consists of, through linear models, eliminating the part of the data that is considered "healthy" and keeping only the component that is due to the disease. The genes are then selected according to their variability and whether they are related to survival and the values of the filtering function for each patient are calculated taking into account the survival associated with each gene. Finally, the mapper algorithm is applied from the disease component matrix and the values of the filter function obtaining a combinatory graph.
gsstda(
full_data,
survival_time,
survival_event,
case_tag,
control_tag = NA,
gamma = NA,
gen_select_type = "Top_Bot",
percent_gen_select = 10,
num_intervals = 5,
percent_overlap = 40,
distance_type = "correlation",
clustering_type = "hierarchical",
num_bins_when_clustering = 10,
linkage_type = "single",
optimal_clustering_mode = NA,
silhouette_threshold = 0.25,
na.rm = TRUE
)
full_data |
Input matrix whose columns correspond to the patients and rows to the genes. |
survival_time |
Numerical vector of the same length as the number of
columns of |
survival_event |
Numerical vector of the same length as the number of
columns of |
case_tag |
Character vector of the same length as the number of
columns of |
control_tag |
Tag of the healthy sample.E.g. "T" |
gamma |
A parameter that indicates the magnitude of the noise assumed in
the flat data matrix for the generation of the Healthy State Model. If it
takes the value |
gen_select_type |
Option. Options on how to select the genes to be used in the mapper. Select the "Abs" option, which means that the genes with the highest absolute value are chosen, or the "Top_Bot" option, which means that half of the selected genes are those with the highest value (positive value, i.e. worst survival prognosis) and the other half are those with the lowest value (negative value, i.e. best prognosis). "Top_Bot" default option. |
percent_gen_select |
Percentage (from zero to one hundred) of genes to be selected to be used in mapper. 10 default option. |
num_intervals |
Parameter for the mapper algorithm. Number of intervals used to create the first sample partition based on filtering values. 5 default option. |
percent_overlap |
Parameter for the mapper algorithm. Percentage of overlap between intervals. Expressed as a percentage. 40 default option. |
distance_type |
Parameter for the mapper algorithm. Type of distance to be used for clustering. Choose between correlation ("correlation") and euclidean ("euclidean"). "correlation" default option. |
clustering_type |
Parameter for the mapper algorithm. Type of clustering method. Choose between "hierarchical" and "PAM" (“partition around medoids”) options. "hierarchical" default option. |
num_bins_when_clustering |
Parameter for the mapper algorithm. Number of bins to generate the histogram employed by the standard optimal number of cluster finder method. Parameter not necessary if the "optimal_clust_mode" option is "silhouette" or the "clust_type" is "PAM". 10 default option. |
linkage_type |
Parameter for the mapper algorithm. Linkage criteria used in hierarchical clustering. Choose between "single" for single-linkage clustering, "complete" for complete-linkage clustering or "average" for average linkage clustering (or UPGMA). Only necessary for hierarchical clustering. "single" default option. |
optimal_clustering_mode |
Method for selection optimal number of clusters. It is only necessary if the chosen type of algorithm is hierarchical. In this case, choose between "standard" (the method used in the original mapper article) or "silhouette". In the case of the PAM algorithm, the method will always be "silhouette". |
silhouette_threshold |
Minimum value of |
na.rm |
|
A gsstda
object. It contains:
the matrix with the normal space normal_space
,
the matrix of the disease components normal_space matrix_disease_component
,
a matrix with the results of the application of proportional hazard models
for each gene (cox_all_matrix)
,
the genes selected for mapper genes_disease_componen
,
the matrix of the disease components with information from these genes only
genes_disease_component
and a mapper_obj
object. This mapper_obj
object contains the
values of the intervals (interval_data), the samples included in each
interval (sample_in_level), information about the cluster to which the
individuals in each interval belong (clustering_all_levels), a list including
the individuals contained in each detected node (node_samples), their size
(node_sizes), the average of the filter function values of the individuals
of each node (node_average_filt) and the adjacency matrix linking the nodes
(adj_matrix). Moreover, information is provided on the number of nodes,
the average node size, the standard deviation of the node size, the number
of connections between nodes, the proportion of connections to all possible
connections and the number of ramifications.
gsstda_object <- gsstda(full_data, survival_time, survival_event, case_tag, gamma=NA,
gen_select_type="Top_Bot", percent_gen_select=10,
num_intervals = 4, percent_overlap = 50,
distance_type = "euclidean", num_bins_when_clustering = 8,
clustering_type = "hierarchical", linkage_type = "single")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.