SubtypingOmicsData: Subtyping multi-omics data

Description Usage Arguments Details Value References See Also Examples

Description

Perform subtyping using multiple types of data

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
SubtypingOmicsData(
  dataList,
  kMin = 2,
  kMax = 5,
  k = NULL,
  agreementCutoff = 0.5,
  ncore = 1,
  verbose = T,
  sampledSetSize = 2000,
  knn.k = NULL,
  ...
)

Arguments

dataList

a list of data matrices. Each matrix represents a data type where the rows are items and the columns are features. The matrices must have the same set of items.

kMin

The minimum number of clusters used for automatically detecting the number of clusters in PerturbationClustering. This paramter is passed to PerturbationClustering and does not affect the final number of cluster in SubtypingOmicsData. Default value is 2.

kMax

The maximum number of clusters used for automatically detecting the number of clusters in PerturbationClustering. This paramter is passed to PerturbationClustering and does not affect the final number of cluster in SubtypingOmicsData. Default value is 5.

k

The number of clusters. If k is set then kMin and kMax will be ignored.

agreementCutoff

agreement threshold to be considered consistent. Default value is 0.5.

ncore

Number of cores that the algorithm should use. Default value is 1.

verbose

set it to TRUE of FALSE to get more or less details respectively.

sampledSetSize

The number of sample size used for the sampling process when dataset is big. Default value is 2000.

knn.k

The value of k of the k-nearest neighbors algorithm. If knn.k is not set then it will be used elbow method to calculate the k.

...

these arguments will be passed to PerturbationClustering algorithm. See details for more information

Details

SubtypingOmicsData implements the Subtyping multi-omic data that are based on Perturbaion clustering algorithm of Nguyen et al (2017), Nguyen et al (2019) and Nguyen, et al. (2021). The input is a list of data matrices where each matrix represents the molecular measurements of a data type. The input matrices must have the same number of rows. SubtypingOmicsData aims to find the optimum number of subtypes and location of each sample in the clusters from integrated input data dataList through two processing stages:

1. Stage I: The algorithm first partitions each data type using the function PerturbationClustering. It then merges the connectivities across data types into similarity matrices. Both kmeans and similarity-based clustering algorithms - partitioning around medoids pam are used to partition the built similarity. The algorithm returns the partitioning that agrees the most with individual data types.
2. Stage II: The algorithm attempts to split each discovered group if there is a strong agreement between data types, or if the subtyping in Stage I is very unbalanced.

When clustering a large number of samples, this function uses a subsampling technique to reduce the computational complexity with the two parameters sampledSetSize and knn.k. Please consult Nguyen et al. (2021) for details.

Value

SubtypingOmicsData returns a list with at least the following components:

cluster1

A vector of labels indicating the cluster to which each sample is allocated in Stage I

cluster2

A vector of labels indicating the cluster to which each sample is allocated in Stage II

dataTypeResult

A list of results for individual data type. Each element of the list is the result of the PerturbationClustering for the corresponding data matrix provided in dataList.

References

1. H Nguyen, S Shrestha, S Draghici, & T Nguyen. PINSPlus: a tool for tumor subtype discovery in integrated genomic data. Bioinformatics, 35(16), 2843-2846, (2019).

2. T Nguyen, R Tagett, D Diaz, S Draghici. A novel method for data integration and disease subtyping. Genome Research, 27(12):2025-2039, 2017.

3. T. Nguyen, "Horizontal and vertical integration of bio-molecular data", PhD thesis, Wayne State University, 2017.

4. H Nguyen, D Tran, B Tran, M Roy, A Cassell, S Dascalu, S Draghici & T Nguyen. SMRT: Randomized Data Transformation for Cancer Subtyping and Big Data Analysis. Frontiers in oncology. 2021.

See Also

PerturbationClustering

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Load the kidney cancer carcinoma data
data(KIRC)

# Perform subtyping on the multi-omics data
dataList <- list (as.matrix(KIRC$GE), as.matrix(KIRC$ME), as.matrix(KIRC$MI)) 
names(dataList) <- c("GE", "ME", "MI")
result <- SubtypingOmicsData(dataList = dataList)

# Change Pertubation clustering algorithm's arguments
result <- SubtypingOmicsData(
    dataList = dataList, 
    clusteringMethod = "kmeans", 
    clusteringOptions = list(nstart = 50)
)

# Plot the Kaplan-Meier curves and calculate Cox p-value
library(survival)
cluster1=result$cluster1;cluster2=result$cluster2
a <- intersect(unique(cluster2), unique(cluster1))
names(a) <- intersect(unique(cluster2), unique(cluster1))
a[setdiff(unique(cluster2), unique(cluster1))] <- seq(setdiff(unique(cluster2), unique(cluster1))) 
                                                  + max(cluster1)
colors <- a[levels(factor(cluster2))]
coxFit <- coxph(
 Surv(time = Survival, event = Death) ~ as.factor(cluster2),
 data = KIRC$survival,
 ties = "exact"
)
mfit <- survfit(Surv(Survival, Death == 1) ~ as.factor(cluster2), data = KIRC$survival)
plot(
 mfit, col = colors,
 main = "Survival curves for KIRC, level 2",
 xlab = "Days", ylab = "Survival",lwd = 2
)
legend("bottomright", 
    legend = paste(
        "Cox p-value:", 
        round(summary(coxFit)$sctest[3], digits = 5), 
        sep = ""
    )
)
legend(
    "bottomleft",
    fill = colors,
    legend = paste(
        "Group ",
        levels(factor(cluster2)),": ", table(cluster2)[levels(factor(cluster2))], 
        sep =""
    )
)

PINSPlus documentation built on Dec. 15, 2021, 1:10 a.m.