knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Many numerical analyses are invalid when working with nominal data because the mode is the only way to measure central tendency for nominal data, and frequency testing, like Chi-square tests, is the most common statistical analysis that makes sense.
NIMAA
package [@nimaa] proposes a comprehensive set of pipeline to perform nominal data mining, which can effectively find label relationships of each nominal variable according to pairwise association with other nominal data. You can also check for updates, here NIMAA.
It uses bipartite graphs to show how two different types of data are linked together, and it puts them in the incidence matrix to continue with network analysis. Finding large submatrices with non-missing values and edge prediction are other applications of NIMAA to explore local and global similarities within the labels of nominal variables.
Then, using a variety of different network projection methods, two unipartite graphs are constructed on the given submatrix. NIMAA provides several options for clustering projected networks and selecting the best one based on internal measures and external prior knowledge (ground truth). When weighted bipartite networks are considered, the best clustering results are used as the benchmark for edge prediction analysis. This benchmark is used to figure out which imputation method is the best one to predict weight of edges in bipartite network. It looks at how similar the clustering results are before and after the imputations. By using edge prediction analysis, we tried to get more information from the whole dataset even though there were some missing values.
library(NIMAA)
In this section, we demonstrate how to do a NIMAA analysis on a weighted bipartite network using the beatAML
dataset. beatAML
is one of four datasets that can be found in the NIMAA
package [@nimaa]. This dataset has three columns: the first two contain nominal variables, while the third contains numerical variables.
knitr::kable(NIMAA::beatAML[1:10,], caption='The first ten rows of beatAML dataset')
Read the data from the package:
# read the data beatAML_data <- NIMAA::beatAML
The plotIncMatrix()
function prints some information about the incidence matrix derived from input data, such as its dimensions and the proportion of missing values, as well as the image of the matrix. It also returns the incidence matrix object.
NB: To keep the size of vignette small enough for CRAN rules, we won't output the interactive figure here.
beatAML_incidence_matrix <- plotIncMatrix( x = beatAML_data, # original data with 3 columns index_nominal = c(2,1), # the first two columns are nominal data index_numeric = 3, # the third column is numeric data print_skim = FALSE, # if you want to check the skim output, set this as TRUE plot_weight = TRUE, # when plotting the weighted incidence matrix verbose = FALSE # NOT save the figures to local folder )
knitr::include_graphics("patient_id-inhibitor.png")
Given that we have the incidence matrix, we can easily reconstruct the corresponding bipartite network. In the NIMAA
package, we have two options for visualizing the bipartite network: static or interactive plots.
The plotBipartite()
function customizes the corresponding bipartite network visualization based on the igraph
package [@igraph] and returns the igraph object.
bipartGraph <- plotBipartite(inc_mat = beatAML_incidence_matrix, vertex.label.display = T) # show the igraph object bipartGraph
The plotBipartiteInteractive()
function generates a customized interactive bipartite network visualization based on the visNetwork
package [@vis].
NB: To keep the size of vignette small enough, we do not output
the interactive figure here. Instead, we show a screenshot of part of the beatAML
dataset.
plotBipartiteInteractive(inc_mat = beatAML_incidence_matrix)
knitr::include_graphics("interactiveplot.png")
NIMAA
package contains a function called analyseNetwork
to provide more details about the network topology and common centrality measures for vertices and edges.
analysis_reuslt <- analyseNetwork(bipartGraph) # showing the general measures for network topology analysis_reuslt$general_stats
In the case of a weighted bipartite network, the dataset with the fewest missing values should be used for the next steps. This is to avoid the sensitivity problems of clustering-based methods. The extractSubMatrix()
function extracts the submatrices that have non-missing values or have a certain percentage of missing values inside (not for elements-max matrix), depending on the argument's input. The result will also be shown as a plotly
plot [@plotly], so you can see the screenshots of beatAML
dataset below.
The extraction process is performed and visualized in two ways, which can be chosen depending on the user's preference: using the original input matrix (row-wise) and using the transposed matrix (column-wise). NIMAA
extracts the largest submatrices with non-missing values or with a specified proportion of missing values (using the bar
argument) in four ways predefined in the shape
argument:
Here we extract two shapes of submatrix from the beatAML_incidence_matrix
including square and rectangular, with the maximum number of elements:
sub_matrices <- extractSubMatrix( x = beatAML_incidence_matrix, shape = c("Square", "Rectangular_element_max"), # the selected shapes of submatrices row.vars = "patient_id", col.vars = "inhibitor", plot_weight = TRUE, print_skim = FALSE )
NB: To keep the size of vignette small enough, we show a screenshot.
knitr::include_graphics("Row_wise_arrangement.png")
knitr::include_graphics("Column_wise_arrangement.png")
A value called binmatnest2.temperature is also printed by this function that indicates the nestedness measure of the input matrix. If the input matrix is highly nested (binmatnest2.temperature >= 1
), then splitting the input matrix into parts is recommended to follow the analysis in each part independently.
The findCluster()
function performs seven commonly used network clustering algorithms, with the option of preprocessing the input incidence matrix following the projecting of the bipartite network into unipartite networks. Preprocessing options such as using normalization and defining a threshold help us remove edges that have low weights or are weak edges. Also, it is possible to continue using binarized or original edge weights before performing community detection or clustering. Finally, all of the clustering methods that were used can be compared and the best one chosen based on internal or external cluster validity measures such as average of silhouette width and Jaccard similarity index. Additionally to displaying in consul, the findings are provided as bar plots.
To begin, let us examine how to utilize findCluster
to find clusters of patient_id
s based on the Rectangular_element_max
submatrix of the beatAML
bipartite network projection to a patient's unipartite similarity network.
cls1 <- findCluster( sub_matrices$Rectangular_element_max, part = 1, method = "all", # all available clustering methods normalization = TRUE, # normalize the input matrix rm_weak_edges = TRUE, # remove the weak edges in graph rm_method = 'delete', # delete the weak edges instead of lowering their weights to 0. threshold = 'median', # Use median of edges' weights as threshold set_remaining_to_1 = TRUE, # set the weights of remaining edges to 1 )
Additionally, if external features (prior knowledge) are provided, the function compares the clustering results obtained with the external features in terms of similarity by setting extra_feature
argument.
For instance, let's generate some random features for the patients in the beatAML
dataset. The following is an example:
external_feature <- data.frame(row.names = cls1$walktrap$names) external_feature[,'membership'] <- paste('group', findCluster(sub_matrices$Rectangular_element_max, method = c('walktrap'), rm_weak_edges = T, set_remaining_to_1 = F, threshold = 'median', comparison = F)$walktrap$membership) dset1 <- utils::head(external_feature) knitr::kable(dset1, format = "html")
Clustering is done on the patient similarity network again, but this time by setting the extra_feature
argument to provide external validation for the cluster finding analysis.
cls <- findCluster( sub_matrices$Rectangular_element_max, part = 1, method = "all", normalization = TRUE, rm_weak_edges = TRUE, rm_method = 'delete', threshold = 'median', set_remaining_to_1 = TRUE, extra_feature = external_feature # add an extra feature reference here )
As you can see, two new indices are included (jaccard_similarity
and corrected_rand
) as external cluster validity measures which represent the similarity between clustering and the prior knowledge used as ground truth.
The following sections show how to perform clustering analysis on the other part of the beatAML
bipartite network, namely, inhibitor
. In this case, we simply need to modify the part
to 2.
cls2 <- findCluster( sub_matrices$Rectangular_element_max, # the same submatrix part = 2 # set to 2 to use the other part of bipartite network )
In this section, we look at the clustering results, with a particular emphasis on cluster visualization and scoring measures.
The plotCluster()
function makes an interactive network figure that shows nodes that belong to the same cluster in the same color and nodes that belong to different clusters in separate colors.
plotCluster(graph=cls2$graph,cluster = cls2$louvain)
NB: To keep the size of vignette small enough, we do not output the interactive figure here. Instead, we show a screenshot.
knitr::include_graphics("plotCluster.png")
The Sankey diagram is used to depict the connections between clusters within each part of the bipartite network. The display is also interactive, and by grouping nodes within each cluster as "summary" nodes, the visualClusterInBipartite
function emphasizes how clusters of each part are connected together.
visualClusterInBipartite( data = beatAML_data, community_left = cls1$leading_eigen, community_right = cls2$fast_greedy, name_left = 'patient_id', name_right = 'inhibitor')
NB: To keep the size of vignette small enough, we do not output the interactive figure here. Instead, we show a screenshot.
knitr::include_graphics("visualClusterInBipartite.png")
Based on the fpc
package [@fpc], and @DBLP, the scoreCluster
function provides additional internal cluster validity measures such as entropy and coverage. The concept of scoring is according to the weight fraction of all intra-cluster edges relative to the total weight of all edges in the network.
scoreCluster(community = cls2$infomap, graph = cls2$graph, dist_mat = cls2$distance_matrix)
The validateCluster()
function calculates the similarity of a given clustering method to the provided ground truth. This function provides external cluster validity measures including corrected.rand
and jaccard similarity
.
validateCluster(community = cls$leading_eigen, extra_feature = external_feature, dist_mat = cls$distance_matrix)
The predictEdge()
function utilizes the given imputation methods to impute missing values in the input data matrix. The output is a list with each element representing an updated version of input data matrix with no missing values using a separate imputation method. The following is a detailed list of imputation techniques:
median will replace the missing values with the median of each rows (observations),
knn is the method in bnstruct
package [@bnstruct],
als and svd are methods from softImpute
package[@softImpute],
CA, PCA and FAMD are from missMDA
package [@missMDA],
others are from the famous mice
package [@mice].
imputations <- predictEdge(inc_mat = beatAML_incidence_matrix, method = c('svd','median','als','CA')) # show the result format summary(imputations)
The validateEdgePrediction()
function compares the results of clustering after imputation with a predefined benchmark to see how similar they are. After validating clustering on submatrices with non-missing values, the benchmark had already been set up in the previous step. This function does the analysis for a number of user-specified imputation methods. It's also possible to see how each method is ranked by looking at a ranking plot. The following measures are used to compare the methods:
Jaccard similarity index
Dice similarity coefficient
Rand index
Minkowski measure (inverted)
Fowlkes–Mallows index
validateEdgePrediction(imputation = imputations, refer_community = cls1$fast_greedy, clustering_args = cls1$clustering_args)
Finally, the best imputation method can be chosen, allowing for a re-examination of the label relationships or a re-run of the nominal data mining. In other words, users can use this imputed matrix as an input matrix to redo the previous steps of mining the hidden relationships in the original dataset. Note that this matrix has no missing values, hence no submatrix extraction is required.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.