In HBPMedical/CCC: A wrapper to run the 3C methodology

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The package implements the "3C-strategy": Categorization, Clustering, Classification that had been developed for application in medical research, specifically , to refine a diagnosed diseases to more relevant classes.

C1 - Categorization

The analysis in the package assumes that the data has be prepoccessed as follows: highly correlated variables removed, missing values handled and varialbes were transformed for symmetry.

In the first step, features are categorized, according to experts knowledge, to three categories: + Clinical Measures (CM) - Disease symptoms, Patients' functional state and abilities.
+ Potential Biomarkers (PB) - Imaging, Proteins, Genetics etc.
+ Assigned Diagnosis (DX) - The patients' diagnoses.

Each feature category is registered in an auxilary file, which includes to columns: One named "varName", which is the variable/column name from the data file, and another column named "varCategory" which is the categorization of the variable (i.e.: Dx, CM or PB from the categorization done in stage C1).

A second file includes the data itself: data with Dx (disease diagnosis), CM (Clinical measurements) PB (Potential biomarkers). Each row indicates a patient.

data(c3_sample2)
data(c3_sample2_categories)
head(c3_sample2_categories) 
table(c3_sample2_categories[,"varCategory"])
head(c3_sample2)[, 1:5]

C2 - Clustering

This step includes: 1. Feature Selection: The initial screening of clinical features using the assigned diagnosis as target.
2. decision for number of clusters: The selected features are used to decide the appropriate number of clusters, using K-gap statistic. 3. Clustering

library(CCC)  
CM_AD_matrix <- get_xy_from_DATA_C2(c3_sample2, c3_sample2_categories)
  x <- CM_AD_matrix[[1]]  # CM matrix
  y <- CM_AD_matrix[[2]]  # DD vector

  C2_results <- C2(x[,1:50] , y, feature_selection_method="RF", num_clusters_method="Manhattan", clustering_method="Manhattan", plot.num.clus=TRUE, plot.clustering=TRUE, k=6)

C2_results

When k is missing, the K-gap statistic plot and summary are presented and teh user may choose the value.

C3 - Classification

Potential Biomarkers selection (supervised)
Using our newly assigned clusters from C2 we try to identify potential biomarkers using the newly created clusters as the target.
Same algorithms as for stage C2.
Clusters biomarkers classification (supervised)

Running the C3 function:
Finally, we will create a new response variable (named "newy") using the assigned diagnoses obtained in step C2.
We will also create a data frame that consists of PB variables from our initial data file.
We will use both of the above to create a new classification based on the PB variables.

Feature Selection method: Random Forest (Default)
Classification: Random Forest (Default)

# Creating PB matrix
PBx <- get_PBx_from_DATA_C3(c3_sample2, c3_sample2_categories)
newy <- unlist(C2_results[[3]])
C3_results <- C3(PBx, newy, feature_selection_method="RF", classification_method="RF") 
C3_results
table(newy, C3_results[[2]])